Random string generation on GPU (intro to OpenCL)

The article doesn't go deep into randomization algorithm rather then providing an introduction into OpenCL and parallel execution.

In the article we used the function published by Park and Miller for minimal standard random number generation.

OpenCL code looks similar to C code and usually if it works on CPU can be easily ported to OpenCL:

__kernel void miller_generator(uint seed, uchar start, uchar end,
uint len, __global uint *res_g)
{
  unsigned int a = 16807;
  unsigned int m = 2147483647;
  int gid = get_global_id(0);

  seed = seed * (gid + 1);

  unsigned int final, val;
  res_g = res_g + gid*len/4;
  for (int i = 0; i < len/4; i++) {
    final = 0;
    seed = (a * seed) % m;
    for (int j = 0; j < 4; j++) {
        val = ((seed & (0xff << j*8)) >> j*8) % (end - start + 1) + start;
        final = final | (val << j*8);
    }
    res_g[i] = final;
  }
}

__kernel qualifier declares a function that is visible to host code so it can be en-queued. Kernels can call other functions.

There are 4 address space qualifiers:

__global, __local, __constant, __private

All qualifiers are used to mark objects to specific address space.

For example __global is used to mark objects that belong to global memory, __local to local memory, __private to private memory. Below picture illustrate the difference beetween global, local and private memory:

Global memory is memory visible for all work-groups that have certain number of work-items running on compute units. Local memory is typically shared memory of work-items in a work-group. One compute unit can run several work-groups and only a fraction of the total local memory size may be available to each work-group. Usually local memory has size of several KBytes per compute unit. Private memory is available per each work-item in size of a few words.

In order to achieve good performance the developer shall keep in mind memory hierarchy characteristics.  For example, if work-item allocates more private memory that's available for one work-item it would turn into reduced number of work-items that can be run at the same time harming the performance. The private memory shall be treated as a number of registers available on compute unit divided among other work-items.

OpenCL contains eight built-in functions that can be used in the kernel to query global size, local size:

uint get_work_dim()
Returns the number of dimension (There are 1D, 2D and 3D dimensions)
ssize_t get_global_id(uint dimidx)
Returns the ID of the current work-item in dimension dimidx
size_t get_global_size(uint dimidx)
Returns the total number of work-items in dimension dimidx
size_t get_global_offset(uint dimidx)
Returns the offset as specified in the enqueueNDRangeKernel API in dimension dimidx
size_t get_group_id(uint dimidx)
Returns the ID of the current work-group in dimension dimidx
size_t get_local_id(uint dimidx)
Returns the ID of the work-item within the work-group in dimension dimidx
size_t get_local_size (uint dimidx)
Returns the number of work-items per work-group in dimension dimidx
size_t get_num_groups(uint dimidx)
Returns the total number of work-groups in dimension dimidx

Our kernel function defined as follows:

__kernel void miller_generator(uint seed, uchar start, uchar end,
uint len, __global uint *res_g);

The purpose of the function is to generate a random string which contains charachters in the range of [start,end] with the length of len.

Seed is value used for random value computation. Final string will be stored to global res_g.

Since the kernel will be called by each work-item the function shall adapt seed (seed shall be different for each work-item otherwise the random value will be the same for all work-items) as well as reg_g pointer - shall start with the offset for each work-item.

Finally the host program shall build OpenCL code and enque the kernel with desired number of work-items to be executed in parallel:

#define MAX_WORK_ITEMS  2000 // Total 2304 cores on AMD GPU
#define MAX_STR_LEN     32
#define MAX_STR_NUM     (size_t)10000000

// Create the compute program from the source buffer
program = clCreateProgramWithSource(context, 1,
           (const char **) &programBuffer, &programSize, &err);
// Build the program
clBuildProgram(program, 0, NULL, NULL, NULL, NULL);
// Create the compute kernel in the program we wish to run
kernel = clCreateKernel(program, "miller_generator", &err);
// Create the input and output arrays in device memory\
output = clCreateBuffer(context, CL_MEM_WRITE_ONLY,
           MAX_STR_NUM*MAX_STR_LEN, NULL, NULL);

// Set the arguments to our compute kernel
time_t t; time(&t);
clSetKernelArg(kernel, 0, sizeof(unsigned int), &t);
unsigned char start = 0x21; // U+0021 EXCLAMATION MARK
clSetKernelArg(kernel, 1, sizeof(unsigned char), &start);
unsigned char end = 0x7e; // U+007E TILDE
clSetKernelArg(kernel, 2, sizeof(unsigned char), &end);
unsigned int size = MAX_STR_NUM*MAX_STR_LEN/MAX_WORK_ITEMS;
clSetKernelArg(kernel, 3, sizeof(unsigned int), &size);
clSetKernelArg(kernel, 4, sizeof(output), &output);
// Get the maximum work group size for executing the kernel
clGetKernelWorkGroupInfo(kernel, device_id,
      CL_KERNEL_WORK_GROUP_SIZE, sizeof(local), &local, NULL);

// Execute the kernel over the entire range of our 1d input data set
global = MAX_WORK_ITEMS;
clEnqueueNDRangeKernel(commands, kernel, 1, NULL,
            &global, &local, 0, NULL, NULL);
// Wait for the command commands to get serviced before reading back result
clFinish(commands);
// Read back the results from the device for verification
clEnqueueReadBuffer( commands, output, CL_TRUE,
            0, MAX_STR_NUM*MAX_STR_LEN, result, 0, NULL, NULL );

Generation of 320 million character string or 10 million strings containing 32 charachters test:

evg@evg:~/projects/test/opencl_random$ ./gpu_random
num_platforms = 2
Create a compute context
Create a command commands
Create the compute program from the source buffer
Build the program
Create the compute kernel in the program we wish to run
Create the input and output arrays in device memory
Set the arguments to our compute kernel
Get the maximum work group size for executing the kernel
Exec time: 74836 microsecs

The same test on AMD CPU (1 Core):

evg@evg:~/projects/test/opencl_random$ ./cpu_random
Test for 10000000 reps
Exec time: 473614 microsecs

Worth mentioning that for the GPU test the one of the most expensive operations is memory copying from VRAM to DRAM which affects performance negatively. It takes only 32 microseconds to do the computation work on GPU, while for the CPU test it takes 269 milliseconds just for the computation to be finished!

Conclusion: Developers shall avoid big buffers copying back to DRAM while using GPU.

Please check out the code used in this article here.

References:

Random Number Generators: Good Ones are hard to find

OpenCL: A Hands-on Introduction