CUDA 03 GPU Algorithms

There are three fundamental algorithms: reduce, scan and histogram.

A reduction in CUDA is a parallel algorithm often used to sum, multiply, or combine elements of an array to produce a single result (e.g., finding the sum of an array). It involves operations where threads cooperate to perform partial computations, typically using shared memory to aggregate results efficiently.

Reduction typically uses an iterative approach within CUDA kernels where operations are done in stages, with each stage reducing the data set size until one result is produced.

Serial reduction compared to parallel reduction you can see the efficiency of parallel reduction, the steps is log(n), massively reduced the steps needed.

for example, write simple codes on serial reduction, first declare it’s a cuda on kernal

__global__ void global_reduce_kernel(float * d_out, float * d_in)

Then thread identification

int myId = threadIdx.x + blockDim.x * blockIdx.x;  // Global thread ID
int tid = threadIdx.x;                             // Local thread ID within block

then reduction loop

for (unsigned int s = blockDim.x / 2; s > 0; s >>= 1)
  if (tid < s) {
    d_in[myId] += d_in[myId + s];
}

__syncthreads();

if (tid == 0) {
d_out[blockIdx.x] = d_in[myId];
}

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.