Due to the rapid growth of artificial intelligence, there has been a significant increase in demand for Graphics Processing Units (GPUs). Each GPU operates with its own hardware-level language known as CUDA. While an in-depth understanding of CUDA is not strictly necessary, a foundational grasp of its principles is essential for effectively applying AI methodologies. It is advisable to begin by developing a function for vector addition on the GPU.
#include <iostream>
#include <cuda_runtime.h>
#define N 512 // Number of elements in the vectors
// Kernel function for vector addition
__global__ void vectorAdd(int *A, int *B, int *C, int n) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
if (index < n) {
C[index] = A[index] + B[index];
}
}
int main() {
int size = N * sizeof(int); // Total size of the arrays in bytes
// Host memory allocation
int *h_A = (int*)malloc(size);
int *h_B = (int*)malloc(size);
int *h_C = (int*)malloc(size);
// Initialize host arrays
for (int i = 0; i < N; i++) {
h_A[i] = i;
h_B[i] = i * 2;
}
// Device memory allocation
int *d_A, *d_B, *d_C;
cudaMalloc((void**)&d_A, size);
cudaMalloc((void**)&d_B, size);
cudaMalloc((void**)&d_C, size);
// Copy data from host to device
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
// Launch kernel with 1 block and N threads per block
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock; // Round up to ensure all elements are covered
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
// Copy result from device to host
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
// Verify the result
for (int i = 0; i < N; i++) {
if (h_C[i] != h_A[i] + h_B[i]) {
std::cerr << "Error at index " << i << ": " << h_C[i] << " != " << h_A[i] + h_B[i] << std::endl;
return -1;
}
}
std::cout << "Vector addition completed successfully!" << std::endl;
// Free device memory
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
// Free host memory
free(h_A);
free(h_B);
free(h_C);
return 0;
}
Note: __global__ this declares a function that will run on the GPU and can be called from the CPU. it is a kernel. int *h_A = (int*)malloc(size); int *h_B = (int*)malloc(size); int *h_C = (int*)malloc(size) is to allocate memory for vectors A, B and C on the CPU(host) using malloc; cudaMalloc((void**)&d_A, size); cudaMalloc((void**)&d_B, size); cudaMalloc((void**)&d_C, size) is to allocate memory for the vector on the GPU(device) using cudamalloc; cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice) copy data from host to device using cudaMemcpy; then launch kernel vectorAdd<<>>(d_A, d_B, d_C, N); then copy result back to cpu or host by cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); finally free memory cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); free(h_A); free(h_B); free(h_C).
In the context of GPUs (Graphics Processing Units) and parallel programming, particularly with frameworks like CUDA or OpenCL, a kernel is a function written by the programmer that runs on the GPU. It represents the code that gets executed by thousands of GPU threads in parallel.