PA4 - Tiled Matrix Multiplication

Objective

Implement a tiled dense matrix multiplication routine using shared memory. Note: you will not need to transpose any matrix for PA4.

Instructions

Edit the code in the code tab to perform the following:

Allocate device memory
Copy host memory to device
Initialize work group dimensions
Invoke OpenCL kernel
Copy results from device to host
Deallocate device memory
Implement the matrix-matrix multiplication routine using shared memory and tiling

Files and Directories

Makefile: Automates the compilation and execution process for the project.
main.c: Source code for the tiled matrix multiplication using OpenCL.
Dataset/dataset_generator.py: Source code for generating random matrices.
kernel.cl: OpenCL kernel file for performing tiled matrix multiplication.

How to Compile

The main.c, kernel.cl files contains the code for the programming assignment. It can be run by typing make from the PA4 folder. It generates a solution output file.

How to Test

Use the make run command to test your program. here are a total of 11 tests on which your program will be evaluated for (functional) correctness. We will use the last test case (testcase 10) to verify if your programs meet the speedup requirements that you should get using shared memory. The timing requirements will only be strict enough to ensure students cannot submit PA3’s solution in PA4 and get credit. Use the make time command to see timing details for your kernel. Your kernel must produce a time less than 43ms.

Dataset Generation (Optional)

The dataset required to test the program is already generated. If you are interested in how the dataset is generated please refer to the dataset_generator.py file in the Dataset folder. To recreate the dataset, run the dataset_generator.py file.

Submission

Submit the main.c and kernel.cl files on Gradescope. Preserve the file name and kernel file name as the kernel name is used to identify and time the kernel code. Gradescope will only accept 1 submission per hour. Please do not use Gradescope to time your code.

Grading

You will be graded on correctness (95pts) and on your time on a 1080ti. Times subtract from your correctness.

$\text{kernel runtime} \geq 43 \text{ms} : -95 \text{pts}$
$40 \text{ms} \leq \text{kernel runtime} < 43 \text{ms} : -10 \text{pts}$
$15 \text{ ms} \leq \text{kernel runtime} < 40 \text{ ms} : -5 \text{pts}$
$\text{kernel runtime} < 15 \text{ ms} : -0 \text{pts}$

These times will be on the leaderboard

Optional

Like last PA, we have a optional section on the PA for bragging rights. A major point behind using OpenCL is applying the same kernel to many devices. So also on the leaderboard are three other devices (a CPU, a Google Pixel Fold’s GPU, and a Thundercomm Rubik Pi’s GPU):

Platform	Device
Intel® OpenCL	Intel® Xeon® Platinum 8275CL
Portable Computing Language	NVIDIA GeForce GTX 1080 Ti
Google Tensor	Mali-G710 r0p0
Qualcomm Dragonwing™	Qualcomm Adreno™ 643

Good implementations may optimize for one device, great implementations will optimize for many devices that you are targeting for.

Hint 0: if you looked for “//@@ Hint”, you will find how to identify the platform and device name, in case that is useful for optimizing per device. Hint 1: You will not have access to the RubricPis and phones and the rate limit still applies on gradescope. What kind of optimizations can you do just without device access but with publically available information about the device? How do you reduce trial and error here?