Some Neural Network Concepts

What is residual pathway? A residual pathway (or residual connection) is a mechanism in neural networks that allows the original input of a layer to be added directly to its output. It was introduced in ResNet (Residual Networks) and has since become a fundamental component in modern architectures like Transformers. In a Transformer block, the residual pathway is used after components like self-attention and feed-forward layers.

Why use residual pathway? let’s take a look at an example, Let’s say GPT is generating text, and the input sequence is:”The cat sat”\text{“The cat sat”}”The cat sat”

What is Multi-Head Attention? Multi-head attention allows the model to focus on different aspects of the input sequence simultaneously, extracting nuanced relationships between words. Each “head” in multi-head attention looks at the sequence from a different perspective, capturing patterns like:

How “cat” relates to “sat” (subject-action).
How “cat” relates to “The” (determiner-subject).
How “sat” relates to “The” (overall sentence structure).

Without multi-head attention, GPT would struggle to grasp complex relationships and patterns, limiting its ability to generate coherent and contextually relevant text.

Using the same ‘The cat sat’ example,

How are these weight matrix Q, K and V calculated?

Initialization:
- WQW_QWQ, WKW_KWK, and WVW_VWV are initialized randomly (e.g., using Xavier or He initialization).
Forward Pass:
- The input embeddings (XXX) are transformed into Q,K,Q, K,Q,K, and VVV using the weight matrices.
Loss Computation:
- The model computes a loss based on its predictions and the ground truth (e.g., cross-entropy loss for text generation tasks).
Backpropagation:
- Gradients of the loss with respect to WQ,WK,W_Q, W_K,WQ,WK, and WVW_VWV are computed.
Weight Update:
- Using optimization algorithms (e.g., Adam), WQ,WK,W_Q, W_K,WQ,WK, and WVW_VWV are updated to minimize the loss.

What is Causal Self-Attention? Causal self-attention is a variation of the self-attention mechanism designed to respect the temporal or sequential nature of input data. It ensures that each token in the input sequence only attends to itself and the tokens before it, not the tokens that come after. This property is crucial for tasks like language modeling, where future tokens should not influence predictions for the current token.

intuitively you can understand why mask matrix like below is used to realize Causal self-attention:

import torch
import torch.nn as nn

class CausalSelfAttention(nn.Module):
    def __init__(self, d_model, n_head, block_size):
        super().__init__()
        self.n_head = n_head
        self.d_model = d_model
        self.block_size = block_size

        self.qkv_proj = nn.Linear(d_model, 3 * d_model)
        self.out_proj = nn.Linear(d_model, d_model)
        self.register_buffer("mask", torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):
        B, T, C = x.size()  # Batch size, sequence length, embedding size
        qkv = self.qkv_proj(x)  # Compute Q, K, V
        q, k, v = qkv.chunk(3, dim=-1)

        # Compute attention scores
        attn_scores = (q @ k.transpose(-2, -1)) / (C ** 0.5)

        # Apply causal mask
        attn_scores = attn_scores.masked_fill(self.mask[:T, :T] == 0, float('-inf'))
        attn_probs = torch.softmax(attn_scores, dim=-1)

        # Compute output
        out = attn_probs @ v
        return self.out_proj(out)

In sequence processing and attention mechanisms, B, T, and C are often used to describe the dimensions of a tensor (multi-dimensional array).

Batch Size (B): Affects the throughput of training. Larger batches improve parallelism but require more memory. Sequence Length (T): Determines the amount of context a model processes. Longer sequences provide more information but are computationally expensive. Embedding Size (C): Encodes the richness of features. Higher CCC captures more details but increases model size.

so in Andrej’s names.txt., he used character/letter to predict next letter in name, if bigram and if use emma as example, .e, em, mm, ma provides batch size B=4, sequence length T = 2 ; Feature Dimension (C) refers to the representation size of each token. If each character is mapped to a one-hot vector of size 27 (for 26 letters + 1 special token for .), C=27.If an embedding table maps characters to a smaller embedding size (e.g., 2), then C=2.

Naixian Zhang

Some Neural Network Concepts

How are these weight matrix Q, K and V calculated?

Leave a comment Cancel reply

How are these weight matrix Q, K and V calculated?

Share this:

Related

Leave a comment Cancel reply