GPU Computing Resources and Community at the University of Sheffield

*by Dr Paul Richmond (University of Sheffield)*

Get the Starting code from github by cloning the master branch of the CUDALab03 repository from the RSE-Sheffield github account. E.g.

```
git clone https://github.com/RSE-Sheffield/CUDALab03.git
```

This will check out all the starting code for you to work with.

For exercise one, we are going to modify an implementation of matrix multiply (provided in `matrixmul.cu`

). The implementation provided multiplies a Matrix A by a Matrix B to produce a Matrix C. The widths and heights of the matrices can be modified but for simplicity must be a factor of the `BLOCK_SIZE`

(which has been predefined as a macro in the code). The implementation is currently very inefficient as it performs `A_WIDTH × B_HEIGHT`

memory loads to compute each product (value in matrix C). To improve this, we will implement a blocked matrix multiply which uses CUDAs shared memory to reduce the number of memory reads by a factor of `BLOCK_SIZE`

. First note the performance of the original version and then modify the existing code to perform a blocked matrix multiply.

To implement a blocked matrix multiply, it is required that we load `NUM_SUBS`

square sub matrices of the matrix A and B into shared memory so that we can compute the intermediate result of the sub matrix products. In the example figure (above), where the `NUM_SUBS`

is equal to two, the sub matrix `C(1,1)`

can be calculated by a square thread block of `BLOCK_SIZE x BLOCK_SIZE`

where each thread (with location `tx`

, `ty`

in the square thread block) performs the following steps which requires two stages of loading matrix tiles into shared memory.

- Load the two sub-matrices and into shared memory. For these sub matrices each thread should
- Load an element of the sub matrix into shared memory from the matrix at position
`(ty+BLOCK_SIZE, tx)`

- Each thread should load an element of the sub matrix into shared memory from the matrix at position
`(ty, tx+BLOCK_SIZE)`

- Synchronise to ensure all threads have completed loading sub matrix values to shared memory

- Load an element of the sub matrix into shared memory from the matrix at position
- Compute the dot product of each row in sub-matrix with each column in the sub-matrix , storing the result in a local variable. This is achieved through the following steps.
- Iterate from
`0`

to`BLOCK_SIZE`

to multiply row`ty`

of (from shared memory) by column`tx`

of (from shared memory) to calculate the sub matrix product value. - Store the sub matrix product value in a thread local variable.
- Synchronise to ensure that all threads have finished reading from shared memory.

- Iterate from
- Repeat steps 1 & 2 for the next sub-matrix (or matrices in the general case), adding each new dot product result to the previous result. For the example in the figure, there is only one more iteration required to load the final 2 sub matrices. E.g.
- Load an element of the sub matrix into shared memory from Matrix at position
`(ty+BLOCK_SIZE, tx+BLOCK_SIZE)`

- Load an element of the sub matrix into shared memory from matrix at position
`(ty+BLOCK_SIZE tx+BLOCK_SIZE,)`

- Synchronise to ensure all threads have completed loading sub matrix values to shared memory
- Iterate from
`0`

to`BLOCK_SIZE`

to multiply row`tx`

of (from shared memory) by column`ty`

of (from shared memory) to calculate the sub matrix product value. - Add this sub matrix product value to the one calculated for the previous sub matrices.

- Load an element of the sub matrix into shared memory from Matrix at position
- Store the sum of the sub-matrix dot products into global memory at location
`(x, y)`

of Matrix .

Following the approach described above for the example in figure, modify the code where it is marked `TODO`

to support the general case (any sizes of and which are a multiple of `BLOCK_SIZE`

). Test and benchmark your code compared to the original version.

*Note: When building in release mode the CUDA compiler often fuses a multiply and add instruction which causes an improvement in performance but some small loss of accuracy. To ensure that your test passes, you should either*

- Implement a small epsilon value and compare the difference of the host and device version to this to ensure that it is within an acceptable range.
- Modify the
`nvcc`

compiler flag to avoid the use of fused multiply by adding`–fmad=false`

to your compilation options.

Assuming you have started an interactive session on a CPU worker node with `qrshx`

and your `ssh`

session with the `–X`

argument (in the Putty configurations) you can use X forwarding to view the image using the `viewraytrace.py`

python file provided. You will need to ensure that you have an X server running on your local (not ShARC) machine. If you are using a CICS managed desktop machine then run `Xming`

. First run the following command from your interactive session to enable to necessary python imaging libraries;

```
module load apps/python/anaconda3-4.2.0
```

Next you can run the following which will open a graphical window on your local machine displaying the image from the remote (ShARC) machine.

```
python viewraytrace.py
```

For this exercise, we are going to optimise a piece of code which implements a simple ray tracer. We will explore how changing the different types of memory affect performance. The ray tracer is a simple ray casting algorithm which casts a raw for each pixel into a scene consisting of sphere objects. The ray checks for intersections with the spheres, where there is an intersection, a colour value for the pixel is generated based on the intersection position of the ray on the sphere (giving an impression of forward facing lighting). For more information on the ray tracing technique, read Chapter 6 of the CUDA by Example book which this exercise is based on. Try compiling and executing the starting code `raytracer.cu`

and examining the output image (`output.ppm`

).

The initial code places the spheres in GPU global memory. We know that there are two good options for improving this in the form of constant memory and read only memory. Implement the following changes.

- 2.1 Create a modified version of ray tracing kernel which uses the read-only data cache (
`ray_trace_read_only`

). You should implement this by using the`const`

and`__restrict__`

qualifiers on the function arguments. You will need to complete the kernel and kernel call. Calculate the execution time of the new version alongside the old version so that they can be directly compared: You will need to also create a modified version of the sphere intersect function (`sphere_intersect_read_only`

). - 2.2 Create a modified version of ray tracing kernel which uses the constant data cache (
`ray_trace_const`

). You will need to complete the kernel and kernel call. Calculate the execution time of the new version alongside the other two versions so that they can be directly compared. - 2.3 How does the performance compare? Is this what you expected and why? Modify the number of spheres to complete the following table. For an extra challenge try to do this automatically so that you loop over all the sphere count sizes in the table and record the timing results in a 2D array.

Sphere Count | Normal | Read-only cache | Constant cache |
---|---|---|---|

16 | |||

32 | |||

64 | |||

128 | |||

256 | |||

1024 | |||

2048 |

The exercise solutions are available from the solution branch of the repository. To check these out either clone the repository using the branch command to a new directory as follows;

```
git clone -b solutions https://github.com/RSE-Sheffield/CUDALab03.git
```

Alternately commit your changes and switch branch:

```
git commit -m “my local changes to src files”
git checkout solutions
```

You will need to commit your local changes to avoid overwriting your changes when switching to the solutions branch. You can then return to your modified versions by returning to the master branch.

| Lab 02 | CUDA Training Home | Lab04 |