Why GPU and Necessity to Learn CUDA?

Due to the rapid growth of artificial intelligence, there has been a significant increase in demand for Graphics Processing Units (GPUs). Each GPU operates with its own hardware-level language known as CUDA. While an in-depth understanding of CUDA is not strictly necessary, a foundational grasp of its principles is essential for effectively applying AI methodologies. It is advisable to begin by developing a function for vector addition on the GPU.

#include <iostream>
#include <cuda_runtime.h>

#define N 512  // Number of elements in the vectors

// Kernel function for vector addition
__global__ void vectorAdd(int *A, int *B, int *C, int n) {
    int index = threadIdx.x + blockIdx.x * blockDim.x;
    if (index < n) {
        C[index] = A[index] + B[index];
    }
}

int main() {
    int size = N * sizeof(int);  // Total size of the arrays in bytes

    // Host memory allocation
    int *h_A = (int*)malloc(size);
    int *h_B = (int*)malloc(size);
    int *h_C = (int*)malloc(size);

    // Initialize host arrays
    for (int i = 0; i < N; i++) {
        h_A[i] = i;
        h_B[i] = i * 2;
    }

    // Device memory allocation
    int *d_A, *d_B, *d_C;
    cudaMalloc((void**)&d_A, size);
    cudaMalloc((void**)&d_B, size);
    cudaMalloc((void**)&d_C, size);

    // Copy data from host to device
    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    // Launch kernel with 1 block and N threads per block
    int threadsPerBlock = 256;
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock; // Round up to ensure all elements are covered
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

    // Copy result from device to host
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    // Verify the result
    for (int i = 0; i < N; i++) {
        if (h_C[i] != h_A[i] + h_B[i]) {
            std::cerr << "Error at index " << i << ": " << h_C[i] << " != " << h_A[i] + h_B[i] << std::endl;
            return -1;
        }
    }

    std::cout << "Vector addition completed successfully!" << std::endl;

    // Free device memory
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);

    // Free host memory
    free(h_A);
    free(h_B);
    free(h_C);

    return 0;
}

Note: __global__ this declares a function that will run on the GPU and can be called from the CPU. it is a kernel. int *h_A = (int*)malloc(size); int *h_B = (int*)malloc(size); int *h_C = (int*)malloc(size) is to allocate memory for vectors A, B and C on the CPU(host) using malloc; cudaMalloc((void**)&d_A, size); cudaMalloc((void**)&d_B, size); cudaMalloc((void**)&d_C, size) is to allocate memory for the vector on the GPU(device) using cudamalloc; cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice) copy data from host to device using cudaMemcpy; then launch kernel vectorAdd<<>>(d_A, d_B, d_C, N); then copy result back to cpu or host by cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); finally free memory cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); free(h_A); free(h_B); free(h_C).

In the context of GPUs (Graphics Processing Units) and parallel programming, particularly with frameworks like CUDA or OpenCL, a kernel is a function written by the programmer that runs on the GPU. It represents the code that gets executed by thousands of GPU threads in parallel.

Naixian Zhang

Why GPU and Necessity to Learn CUDA?

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply