Why GPU and Necessity to Learn CUDA?

Due to the rapid growth of artificial intelligence, there has been a significant increase in demand for Graphics Processing Units (GPUs). Each GPU operates with its own hardware-level language known as CUDA. While an in-depth understanding of CUDA is not strictly necessary, a foundational grasp of its principles is essential for effectively applying AI methodologies. It is advisable to begin by developing a function for vector addition on the GPU.

#include <iostream>
#include <cuda_runtime.h>

#define N 512  // Number of elements in the vectors

// Kernel function for vector addition
__global__ void vectorAdd(int *A, int *B, int *C, int n) {
    int index = threadIdx.x + blockIdx.x * blockDim.x;
    if (index < n) {
        C[index] = A[index] + B[index];
    }
}

int main() {
    int size = N * sizeof(int);  // Total size of the arrays in bytes

    // Host memory allocation
    int *h_A = (int*)malloc(size);
    int *h_B = (int*)malloc(size);
    int *h_C = (int*)malloc(size);

    // Initialize host arrays
    for (int i = 0; i < N; i++) {
        h_A[i] = i;
        h_B[i] = i * 2;
    }

    // Device memory allocation
    int *d_A, *d_B, *d_C;
    cudaMalloc((void**)&d_A, size);
    cudaMalloc((void**)&d_B, size);
    cudaMalloc((void**)&d_C, size);

    // Copy data from host to device
    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    // Launch kernel with 1 block and N threads per block
    int threadsPerBlock = 256;
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock; // Round up to ensure all elements are covered
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

    // Copy result from device to host
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    // Verify the result
    for (int i = 0; i < N; i++) {
        if (h_C[i] != h_A[i] + h_B[i]) {
            std::cerr << "Error at index " << i << ": " << h_C[i] << " != " << h_A[i] + h_B[i] << std::endl;
            return -1;
        }
    }

    std::cout << "Vector addition completed successfully!" << std::endl;

    // Free device memory
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);

    // Free host memory
    free(h_A);
    free(h_B);
    free(h_C);

    return 0;
}

Note: __global__ this declares a function that will run on the GPU and can be called from the CPU. it is a kernel. int *h_A = (int*)malloc(size); int *h_B = (int*)malloc(size); int *h_C = (int*)malloc(size) is to allocate memory for vectors A, B and C on the CPU(host) using malloc; cudaMalloc((void**)&d_A, size); cudaMalloc((void**)&d_B, size); cudaMalloc((void**)&d_C, size) is to allocate memory for the vector on the GPU(device) using cudamalloc; cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice) copy data from host to device using cudaMemcpy; then launch kernel vectorAdd<<>>(d_A, d_B, d_C, N); then copy result back to cpu or host by cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); finally free memory cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); free(h_A); free(h_B); free(h_C).

In the context of GPUs (Graphics Processing Units) and parallel programming, particularly with frameworks like CUDA or OpenCL, a kernel is a function written by the programmer that runs on the GPU. It represents the code that gets executed by thousands of GPU threads in parallel.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.