The article doesn't go deep into randomization algorithm rather then providing an introduction into OpenCL and parallel execution.
In the article we used the function published by Park and Miller for minimal standard random number generation.
OpenCL code looks similar to C code and usually if it works on CPU can be easily ported to OpenCL:
__kernel void miller_generator(uint seed, uchar start, uchar end, uint len, __global uint *res_g) { unsigned int a = 16807; unsigned int m = 2147483647; int gid = get_global_id(0); seed = seed * (gid + 1); unsigned int final, val; res_g = res_g + gid*len/4; for (int i = 0; i < len/4; i++) { final = 0; seed = (a * seed) % m; for (int j = 0; j < 4; j++) { val = ((seed & (0xff << j*8)) >> j*8) % (end - start + 1) + start; final = final | (val << j*8); } res_g[i] = final; } }
__kernel qualifier declares a function that is visible to host code so it can be en-queued. Kernels can call other functions.
There are 4 address space qualifiers:
__global, __local, __constant, __private
All qualifiers are used to mark objects to specific address space.
For example __global is used to mark objects that belong to global memory, __local to local memory, __private to private memory. Below picture illustrate the difference beetween global, local and private memory:

Global memory is memory visible for all work-groups that have certain number of work-items running on compute units. Local memory is typically shared memory of work-items in a work-group. One compute unit can run several work-groups and only a fraction of the total local memory size may be available to each work-group. Usually local memory has size of several KBytes per compute unit. Private memory is available per each work-item in size of a few words.
In order to achieve good performance the developer shall keep in mind memory hierarchy characteristics. For example, if work-item allocates more private memory that's available for one work-item it would turn into reduced number of work-items that can be run at the same time harming the performance. The private memory shall be treated as a number of registers available on compute unit divided among other work-items.
OpenCL contains eight built-in functions that can be used in the kernel to query global size, local size:
uint get_work_dim() Returns the number of dimension (There are 1D, 2D and 3D dimensions) ssize_t get_global_id(uint dimidx) Returns the ID of the current work-item in dimension dimidx size_t get_global_size(uint dimidx) Returns the total number of work-items in dimension dimidx size_t get_global_offset(uint dimidx) Returns the offset as specified in the enqueueNDRangeKernel API in dimension dimidx size_t get_group_id(uint dimidx) Returns the ID of the current work-group in dimension dimidx size_t get_local_id(uint dimidx) Returns the ID of the work-item within the work-group in dimension dimidx size_t get_local_size (uint dimidx) Returns the number of work-items per work-group in dimension dimidx size_t get_num_groups(uint dimidx) Returns the total number of work-groups in dimension dimidx
Our kernel function defined as follows:
__kernel void miller_generator(uint seed, uchar start, uchar end, uint len, __global uint *res_g);
The purpose of the function is to generate a random string which contains charachters in the range of [start,end] with the length of len.
Seed is value used for random value computation. Final string will be stored to global res_g.
Since the kernel will be called by each work-item the function shall adapt seed (seed shall be different for each work-item otherwise the random value will be the same for all work-items) as well as reg_g pointer - shall start with the offset for each work-item.
Finally the host program shall build OpenCL code and enque the kernel with desired number of work-items to be executed in parallel:
#define MAX_WORK_ITEMS 2000 // Total 2304 cores on AMD GPU #define MAX_STR_LEN 32 #define MAX_STR_NUM (size_t)10000000 // Create the compute program from the source buffer program = clCreateProgramWithSource(context, 1, (const char **) &programBuffer, &programSize, &err); // Build the program clBuildProgram(program, 0, NULL, NULL, NULL, NULL); // Create the compute kernel in the program we wish to run kernel = clCreateKernel(program, "miller_generator", &err); // Create the input and output arrays in device memory\ output = clCreateBuffer(context, CL_MEM_WRITE_ONLY, MAX_STR_NUM*MAX_STR_LEN, NULL, NULL); // Set the arguments to our compute kernel time_t t; time(&t); clSetKernelArg(kernel, 0, sizeof(unsigned int), &t); unsigned char start = 0x21; // U+0021 EXCLAMATION MARK clSetKernelArg(kernel, 1, sizeof(unsigned char), &start); unsigned char end = 0x7e; // U+007E TILDE clSetKernelArg(kernel, 2, sizeof(unsigned char), &end); unsigned int size = MAX_STR_NUM*MAX_STR_LEN/MAX_WORK_ITEMS; clSetKernelArg(kernel, 3, sizeof(unsigned int), &size); clSetKernelArg(kernel, 4, sizeof(output), &output); // Get the maximum work group size for executing the kernel clGetKernelWorkGroupInfo(kernel, device_id, CL_KERNEL_WORK_GROUP_SIZE, sizeof(local), &local, NULL); // Execute the kernel over the entire range of our 1d input data set global = MAX_WORK_ITEMS; clEnqueueNDRangeKernel(commands, kernel, 1, NULL, &global, &local, 0, NULL, NULL); // Wait for the command commands to get serviced before reading back result clFinish(commands); // Read back the results from the device for verification clEnqueueReadBuffer( commands, output, CL_TRUE, 0, MAX_STR_NUM*MAX_STR_LEN, result, 0, NULL, NULL );
Generation of 320 million character string or 10 million strings containing 32 charachters test:
evg@evg:~/projects/test/opencl_random$ ./gpu_random num_platforms = 2 Create a compute context Create a command commands Create the compute program from the source buffer Build the program Create the compute kernel in the program we wish to run Create the input and output arrays in device memory Set the arguments to our compute kernel Get the maximum work group size for executing the kernel Exec time: 74836 microsecs
The same test on AMD CPU (1 Core):
evg@evg:~/projects/test/opencl_random$ ./cpu_random Test for 10000000 reps Exec time: 473614 microsecs
Worth mentioning that for the GPU test the one of the most expensive operations is memory copying from VRAM to DRAM which affects performance negatively. It takes only 32 microseconds to do the computation work on GPU, while for the CPU test it takes 269 milliseconds just for the computation to be finished!
Conclusion: Developers shall avoid big buffers copying back to DRAM while using GPU.
Please check out the code used in this article here.
References: