CSC 430 Programming Parallel Computing Systems

Weber/Spring, 2014

Syllabus Announcements Assignments Examples Resources Project

Announcements

Old announcements

Assignments

A4: due 3/19
Write a CUDA program, modeled after minimum.cu,  to calculate the minimum value in an array of float, both on the GPU and on the CPU.  Your code should demonstrate a GPU to CPU speedup of around 5 for n = 1,000,000 data points.  Turn in your CUDA source program via email.
A3: due 2/28
Using histogram3.cu as a starting point, change the gpuHistogram kernel so that each individual thread has its own array of 256 bins, and each of the t = 256 * numSMs threads looks at approximately  n / t of the slots of the data array.  The tricky part is to add all the individual threads' bin values together to produce the final totals needed in the device global array gpuBins.  Start by just using atomicAdd, but we will come up with a better way to do it before you turn it in. 

The way to do it is as follows:

A2: due 2/10
Experiment with this vectorRecurrence.cu program to answer some basic questions about device shared memory.  Send me, via email, the answers to the questions below and the modified vectorRecurrence.cu program.
  1. Compile the program without changing it. Be sure to use the  -arch=compute_20 option.
  2. Which value of k is initially specified?
  3. Run the program with an input value of n = 512 * 14 = 7168.  Which device kernel is faster, the one not using shared memory or the one that does use shared memory?
  4. Run the program several times with the same values for n and k.  Is there much variation in the times reported?
  5. Change the program so that the value of k is 5.  Which device kernel is faster for k = 5, when you give an input size of n = 7168?
  6. Run the program for n = 7168, and each of  k = 1, k = 2, k = 3, k = 4, and k = 5, and fill in the table below with the times for each of the values of k for each of the two kernels.  Does there seem to be any pattern to the times?  If so, try to describe the pattern.
      k = 1 k = 2 k = 3 k = 4 k = 5
    Times for gpuLinearRecur          
    Times for gpuSharedLinearRecur          
  7. What unit of measure are the times reported in?
  8. Now try to change the gpuSharedLinearRecur kernel so that it executes a bit faster, no matter what value of k is given.  Give a second copy of the table with the new times filled in and describe any patterns you see.
A1: due 1/31
Rework the thread example given below so that it will create 32 threads.  Do the same thing with the process example below, but instead of creating threads, create processes. Try to figure out the clever way to do it.  Turn in your source code only (not the executable) for both programs.

Examples

Resources

Project

The course project will entail conversion of aestable.c into a CUDA program. 

The first thing to do is to register your team by email.  As stated in the syllabus, exactly one team will have 1 or 3 members; all other teams will have two members.  Whoever bids for an odd-numbered team first will get the option.

I will be happy to assist in the formation of teams.

Project details: