Building a Mini-GPT Model from Scratch in PyTorch | by Somendu Patel

Ever wondered how models like ChatGPT actually work under the hood? In this blog, we’ll take a friendly but technically detailed tour through the inner workings of a GPT-style model — yes, the same kind of architecture that powers some of today’s most advanced AI systems.

We won’t just talk about theory; we’ll also show how to implement a simplified version from scratch using PyTorch. Whether you’re a machine learning enthusiast, a researcher trying to demystify transformers, or just curious about how these large language models are put together — this guide is for you.

Let’s break down complex ideas into understandable pieces and build our very own tiny GPT.

To process natural language, the text must be broken down into smaller units or tokens. This is handled using a tokenizer. In this implementation, torchtext’s basic_english tokenizer splits the text into lowercase words and punctuation.

A vocabulary is then built from these tokens, assigning a unique integer index to each one. Special tokens like , , and are included to handle unknown words and mark sequence boundaries.

tokenizer = get_tokenizer("basic_english")
vocab = build_vocab_from_iterator(tokenizer(text), specials=['', '', ''])

Only at this point do we convert the textual data into numerical format (input IDs) that the model can understand. These IDs are reshaped into a matrix of shape (batch_size, sequence_length) to facilitate batch processing during training. The vocabulary also provides OOV handling and token-to-id transformation.

A Tokenizer class is designed to:

Tokenize each line of input text
Build a vocabulary using token frequency
Transform tokens into numerical IDs
Reshape the resulting token stream into a batch-friendly format

This forms the essential preprocessing pipeline for feeding data into a transformer.

Language models are trained to predict the next word in a sequence. To do this, we:

Add at the beginning and at the end of each input sequence
Create labels by shifting the input one step to the right

We also apply masking to ignore the token during loss calculation:

y[:,0:-1] = x[:,1:]
y[:,-1] = -100  # Mask padding in loss

This shifted label setup is a fundamental aspect of training autoregressive models like GPT. The masking is vital in autoregressive models like GPT, as it prevents the model from peeking into future tokens during training. This is implemented through a causal mask applied to attention weights.

The core component of GPT is masked multi-head attention. This module projects input embeddings into query (Q), key (K), and value (V) vectors. The attention weights are computed by taking the dot product of Q and K, scaled by the dimension of K, and applying a softmax after masking.

Attention allows the model to weigh the relevance of other tokens in the sequence when predicting a token. In GPT, we use masked self-attention to ensure the model doesn’t peek ahead (future tokens).

This is done via matrix multiplication of queries and keys followed by a scaled softmax and masking operation:

scores = torch.matmul(Q, K.transpose(-2, -1)) / sqrt(dk)
scores = scores.masked_fill(mask == 0, -1e9)

Each head captures a different contextual relationship, and the results are concatenated and transformed:

attn = softmax(scores, dim=-1)

Each attention head operates independently, and their outputs are concatenated and linearly transformed. Multi-head attention enables the model to attend to different subspaces of information in parallel.

The attention output from all heads is concatenated and passed through a final linear projection. Masking ensures causal (left-to-right) processing. This softmax ensures that attention is only computed over visible (past and present) tokens.

Key design choices:

Linear projections for Q, K, V for each head
Scaled dot-product attention with masking
Shared output projection to combine head outputs

Each transformer block includes a position-wise FFN, which applies two linear transformations with a ReLU activation in between. This adds depth and non-linearity to the model, allowing it to model complex functions over the attention outputs.

Parameters:

d_model: Dimension of embeddings
d_ff: Usually 4x d_model for increased capacity

hidden = F.relu(self.fc1(x))
output = self.fc2(hidden)

This adds expressiveness to the model beyond what attention can capture alone.

Since transformers do not have recurrence or convolution, they need a way to capture the order of words. So we add positional encodings to the input embeddings to provide a sense of token order. The encoding uses a fixed sinusoidal pattern based on the token’s position and dimension index.

This encoding is added to the input embeddings and helps the model distinguish between tokens in different positions.

position = torch.arange(seq_len).unsqueeze(1)
encoding = torch.sin(position * freq_tensor)

The model consists of a stack of decoder layers, each containing:

Multi-head masked attention
Add & LayerNorm
Feedforward network
Add & LayerNorm

Each decoder layer processes its input through self-attention followed by FFN, with residual connections and normalization applied at each stage.

Each layer refines token representations by blending contextual information from previous tokens and updating the representation using learned weights. Layer normalization and residual connections ensure stable training.

x = self.layer_norm1(x + self.mhma(x))
x = self.layer_norm2(x + self.ffn(x))

This stacking allows the model to build increasingly abstract features at each level and the architecture enables the model to learn contextual representations that evolve across layers.

Tokens are first passed through an embedding layer to map each ID to a dense vector. Positional encoding is added to retain order information. The input token IDs are first embedded using an nn.Embedding layer, which maps each token index to a learnable vector. This is followed by positional encoding.

The output of the last decoder layer is fed into a linear layer (PredictionHead) that maps the hidden states back to vocabulary space, producing logits for next-token prediction.

logits = self.pred(self.decoder(self.embed(x)))

These logits are used during training with a cross-entropy loss function.

We use CrossEntropyLoss which naturally works with logits and handles token-level classification. Padding tokens are masked using the ignore index:

loss_fn = nn.CrossEntropyLoss(ignore_index=-100)
loss = loss_fn(logits.view(-1, vocab_size), targets.view(-1))

Source link

Building a Mini-GPT Model from Scratch in PyTorch | by Somendu Patel | May, 2025

Can These 2 Factors Fuel a $10 ADA?

Meet LoveJack, the dating app designed for users to find love using just five words

Meet LoveJack, the dating app designed for users to find love using just five words

Leave a Reply Cancel reply

POPULAR POSTS

10 Ways To Get a Free DoorDash Gift Card

They Combed the Co-ops of Upper Manhattan With $700,000 to Spend

Saal.AI and Cisco Systems Inc Ink MoU to Explore AI and Big Data Innovations at GITEX Global 2024

Exxon foe Engine No. 1 to build fossil fuel plants with Chevron

They Wanted a House in Chicago for Their Growing Family. Would $650,000 Be Enough?

Categories

Connect With Us

Recent Posts

Work cuppa: Is it okay to re-boil the kettle? Science has the answer

Forcing LLMs to be evil during training can make them nicer in the long run