There are three fundamental algorithms: reduce, scan and histogram.
A reduction in CUDA is a parallel algorithm often used to sum, multiply, or combine elements of an array to produce a single result (e.g., finding the sum of an array). It involves operations where threads cooperate to perform partial computations, typically using shared memory to aggregate results efficiently.
Reduction typically uses an iterative approach within CUDA kernels where operations are done in stages, with each stage reducing the data set size until one result is produced.
Serial reduction compared to parallel reduction you can see the efficiency of parallel reduction, the steps is log(n), massively reduced the steps needed.
for example, write simple codes on serial reduction, first declare it’s a cuda on kernal
__global__ void global_reduce_kernel(float * d_out, float * d_in)
Then thread identification
int myId = threadIdx.x + blockDim.x * blockIdx.x; // Global thread ID
int tid = threadIdx.x; // Local thread ID within block
then reduction loop
for (unsigned int s = blockDim.x / 2; s > 0; s >>= 1)
if (tid < s) {
d_in[myId] += d_in[myId + s];
}
__syncthreads();
if (tid == 0) {
d_out[blockIdx.x] = d_in[myId];
}