blockDim.xto manipulate one dimensional arrays in device global memory.
gpuHistogramkernel so that each individual thread has its own array of 256 bins, and each of the
t = 256 * numSMsthreads looks at approximately
n / tof the slots of the data array. The tricky part is to add all the individual threads' bin values together to produce the final totals needed in the device global array
gpuBins. Start by just using
atomicAdd, but we will come up with a better way to do it before you turn it in.
|k = 1||k = 2||k = 3||k = 4||k = 5|
gpuSharedLinearRecurkernel so that it executes a bit faster, no matter what value of k is given. Give a second copy of the table with the new times filled in and describe any patterns you see.
-arch=compute_20option since it uses double precision floating point numbers.
g++ threadExample.cpp -pthread.