Versatile “Kernels”

The concept of kernel is everywhere: Here’s a comparison of the various contexts in which the term "kernel" is used, organized in a table format: ContextDefinitionFunctionalityCommon CharacteristicsGPU ComputingA function executed in parallel on the GPU.Performs data processing tasks efficiently using parallelism.Written in CUDA/OpenCL, optimized for large datasets.Linear AlgebraThe set of vectors mapped to zero by … Continue reading Versatile “Kernels”

LLM.c Codes from Andrej’s github

/* GPT-2 Transformer Neural Net training loop. See README.md for usage. */ #include <unistd.h> #include <stdio.h> #include <stdlib.h> #include <stdarg.h> #include <string> #include <string_view> #include <sys/stat.h> #include <sys/types.h> // ----------- CPU utilities ----------- // defines: fopenCheck, freadCheck, fcloseCheck, fseekCheck, mallocCheck // defines: create_dir_if_not_exists, find_max_step, ends_with_bin #include "llmc/utils.h" // defines: tokenizer_init, tokenizer_decode, tokenizer_free #include "llmc/tokenizer.h" // … Continue reading LLM.c Codes from Andrej’s github

Reproduce GPT2 (124M) by Andrej Karpathy LLM.c

In his latest talk at the CUDA event, Andrej showcased his work on replicating the GPT-2 LLM using C and CUDA, effectively eliminating reliance on PyTorch and all dependencies except one. The key takeaway is profound: PyTorch, once considered a massive and indispensable package for LLM and AI programming, is essentially a crutch for when … Continue reading Reproduce GPT2 (124M) by Andrej Karpathy LLM.c

Reproduce GPT2 (124M) by Andrej Karpathy 2 Self-Attention Transformer

The key content here is generated from the 2017 paper "attention is all you need". so what is the attention? Attention is a communication mechanism. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights. but … Continue reading Reproduce GPT2 (124M) by Andrej Karpathy 2 Self-Attention Transformer

Reproduce GPT2 (124M) by Andrej Karpathy 2 Weights and Bias Initials Normalization, BatchNorm and BackProp in Makemore

Diving deeper into Makemore codes to illustrate subtle details affecting the nn output. For example, the concept of "dead neurons" if the squashing function, say, tanh squashed too many inputs to the polar data points of -1 and +1, causing previous neuron's gradients killed: By resetting the scale of the weights and bias initialized, the … Continue reading Reproduce GPT2 (124M) by Andrej Karpathy 2 Weights and Bias Initials Normalization, BatchNorm and BackProp in Makemore

Reproduce GPT2 (124M) by Andrej Karpathy 3 Tokenization

Tokenization is the process of breaking down text into smaller units (tokens) such as words, subwords, or characters. Different tokenization methods are used based on the task, language, and requirements of the model. Word-Based Tokenization Character-Based Tokenization, this increase computation cost a lot. Subword-Based Tokenization like Byte Pair Encoding (BPE), Wordpiece of BERT, SentencePiece. Sentence … Continue reading Reproduce GPT2 (124M) by Andrej Karpathy 3 Tokenization