GPU Computing Resources and Community at the University of Sheffield
by Dr Paul Richmond (University of Sheffield)
Get the Starting code from github by cloning the master branch of the CUDALab03 repository from the RSE-Sheffield github account. E.g.
git clone https://github.com/RSE-Sheffield/CUDALab03.git
This will check out all the starting code for you to work with.
For exercise one, we are going to modify an implementation of matrix multiply (provided in
matrixmul.cu). The implementation provided multiplies a Matrix A by a Matrix B to produce a Matrix C. The widths and heights of the matrices can be modified but for simplicity must be a factor of the
BLOCK_SIZE (which has been predefined as a macro in the code). The implementation is currently very inefficient as it performs
A_WIDTH × B_HEIGHT memory loads to compute each product (value in matrix C). To improve this, we will implement a blocked matrix multiply which uses CUDAs shared memory to reduce the number of memory reads by a factor of
BLOCK_SIZE. First note the performance of the original version and then modify the existing code to perform a blocked matrix multiply.
To implement a blocked matrix multiply, it is required that we load
NUM_SUBS square sub matrices of the matrix A and B into shared memory so that we can compute the intermediate result of the sub matrix products. In the example figure (above), where the
NUM_SUBS is equal to two, the sub matrix
C(1,1) can be calculated by a square thread block of
BLOCK_SIZE x BLOCK_SIZE where each thread (with location
ty in the square thread block) performs the following steps which requires two stages of loading matrix tiles into shared memory.
BLOCK_SIZEto multiply row
tyof (from shared memory) by column
txof (from shared memory) to calculate the sub matrix product value.
BLOCK_SIZEto multiply row
txof (from shared memory) by column
tyof (from shared memory) to calculate the sub matrix product value.
(x, y)of Matrix .
Following the approach described above for the example in figure, modify the code where it is marked
TODO to support the general case (any sizes of and which are a multiple of
BLOCK_SIZE). Test and benchmark your code compared to the original version.
Note: When building in release mode the CUDA compiler often fuses a multiply and add instruction which causes an improvement in performance but some small loss of accuracy. To ensure that your test passes, you should either
nvcccompiler flag to avoid the use of fused multiply by adding
–fmad=falseto your compilation options.
Assuming you have started an interactive session on a CPU worker node with
qrshx and your
ssh session with the
–X argument (in the Putty configurations) you can use X forwarding to view the image using the
viewraytrace.py python file provided. You will need to ensure that you have an X server running on your local (not ShARC) machine. If you are using a CICS managed desktop machine then run
Xming. First run the following command from your interactive session to enable to necessary python imaging libraries;
module load apps/python/anaconda3-4.2.0
Next you can run the following which will open a graphical window on your local machine displaying the image from the remote (ShARC) machine.
For this exercise, we are going to optimise a piece of code which implements a simple ray tracer. We will explore how changing the different types of memory affect performance. The ray tracer is a simple ray casting algorithm which casts a raw for each pixel into a scene consisting of sphere objects. The ray checks for intersections with the spheres, where there is an intersection, a colour value for the pixel is generated based on the intersection position of the ray on the sphere (giving an impression of forward facing lighting). For more information on the ray tracing technique, read Chapter 6 of the CUDA by Example book which this exercise is based on. Try compiling and executing the starting code
raytracer.cu and examining the output image (
The initial code places the spheres in GPU global memory. We know that there are two good options for improving this in the form of constant memory and read only memory. Implement the following changes.
ray_trace_read_only). You should implement this by using the
__restrict__qualifiers on the function arguments. You will need to complete the kernel and kernel call. Calculate the execution time of the new version alongside the old version so that they can be directly compared: You will need to also create a modified version of the sphere intersect function (
ray_trace_const). You will need to complete the kernel and kernel call. Calculate the execution time of the new version alongside the other two versions so that they can be directly compared.
|Sphere Count||Normal||Read-only cache||Constant cache|
The exercise solutions are available from the solution branch of the repository. To check these out either clone the repository using the branch command to a new directory as follows;
git clone -b solutions https://github.com/RSE-Sheffield/CUDALab03.git
Alternately commit your changes and switch branch:
git commit -m “my local changes to src files” git checkout solutions
You will need to commit your local changes to avoid overwriting your changes when switching to the solutions branch. You can then return to your modified versions by returning to the master branch.