GPU Computing Resources and Community at the University of Sheffield
by Dr Paul Richmond (University of Sheffield)
Get the Starting code from github by cloning the master branch of the CUDALab04 repository from the RSE-Sheffield github account. E.g.
git clone https://github.com/RSE-Sheffield/CUDALab04.git
This will check out all the starting code for you to work with.
In order to find the maximum mark from a set of records, we will need to apply a parallel reduction. Rather than summing elements, the reduction will propagate the maximum value of two elements. This will allow us to exploit the GPUs parallelism to find a single value. For this first exercise, we will do this using a CUDA implementation of a reduction. Some code has already been provided for you to complete. Examine the function
maximumMark_CUDA and the associated kernel
maximumMark_CUDA_kernelis for each thread block to reduce
THREADS_PER_BLOCKvalues to a single value. Each thread within the block loads a single element into shared memory, reduce the values from shared memory (to avoid global memory reads) and finally the first thread in the block will write the reduced value to the results array (
d_reduce_marks). The result of the executed kernel will be an array with a single reduced value for each thread block.
To reduce the shared memory values, you must modify the provided loop which has
log2(THREADS_PER_BLOCK) iterations as each iteration divides a stride value in half (
stride>>= 1). You should ensure that only stride threads perform each iteration of the loop and that each participating thread reads two values, makes a comparison to find the maximum value and then writes this to a suitable shared memory location which allows the next iteration to function correctly. An example is shown in the below figure which demonstrates how this algorithms should work for a simplified case where the thread block size is
8 and the number of iterations is
maximumMark_CUDAfunction so that you read back the results of the kernel and reduce the final number of marks with a
float. You should pass through the pointer to the host data array and a pointer to the end of the host data array as arguments to the
thrust::reducecall to find the maximum value by passing the following arguments: an iterator to the start of your thrust device vector, an iterator to the end of your thrust device vector, an initial float value of zero and a binary function operator (in this case
thrust::maximum<float>()). Note: A custom
my_maximumstructure has also been provided so that you can see how an
operatorcan be defined to be used as a custom thrust binary operator.
Using the same
marks.cu file, we will now use thrust to find the number of marks which are greater than
90%. There are a number of ways in which we can do this by applying different algorithms. We will explore two options.
90%by sorting all the marks and then finding the index of the first mark which is greater than 90. To implement this, complete the following in the
thrust::sortalgorithm to sort the data in place
thrust::find_ifalgorithm to return an iterator at the location of the first mark of greater than
90%. Using the
my_maximumstructure example you will need to implement your own binary operator with the following function signature:
__host__ __device__ bool operator()(float x);
thrust::distanceto find the number of elements between the start of your thrust device vector and the returned iterator. Use this to return the correct value.
2.2 An alternative approach to finding marks greater than
90% is to use a prefix sum to partition and re-reorder the data. Thrust provided an algorithms for doing this called
thrust::partition. This algorithm will partition the data into the original array with all records meeting the condition at the front of the array and all records failing the condition at the end of the array. The partition function returns an iterator which points to the end of the data records which pass the condition. Modify the partition_Thrust function to use the
thrust::partition function to find the number of marks which exceed
90%. Hint: you can re-use the binary operator from the previous task.
NUMBER_LOOPSso that you can get more accurate timing values to compare the two approaches.
The exercise solutions are available from the solution branch of the repository. To check these out either clone the repository using the branch command to a new directory as follows;
git clone -b solutions https://github.com/RSE-Sheffield/CUDALab04.git
Alternately commit your changes and switch branch:
git commit -m “my local changes to src files” git checkout solutions
You will need to commit your local changes to avoid overwriting your changes when switching to the solutions branch. You can then return to your modified versions by returning to the master branch.