Tokenization Pipeline — Visual Explainer

01

Regex Tokenization

The simplest approach — split text on punctuation and whitespace using a regular expression pattern.

// Try it — type any sentence below

Input Text

Regex Pattern

Tokens

Token Chips (hover for index)

// What the code does

import re

text = "Hello, world!"

# Pattern: word chars OR any non-whitespace
tokens = re.findall(
  r'\w+|[^\w\s]', text
)
# → ['Hello', ',', 'world', '!']

// Why not just split on spaces?

A naive .split() throws away punctuation entirely — "don't" becomes one token, and commas vanish. The regex pattern captures both words and punctuation as individual tokens, preserving all information.

❌ "don't" → ["don't"] ✓ "don't" → ["don", "'", "t"]

02

Vocabulary Building

Map every unique token to an integer. Add special tokens for unknowns and document boundaries.

// Build vocabulary from your text

Corpus

0

vocab size

0

total tokens

Special tokens always added:

<|unk|> <|endoftext|>

          Type a word not in corpus:

// Vocabulary map (token → id)

03

Byte Pair Encoding

Learn subword tokens by iteratively merging the most frequent adjacent pairs — solving the out-of-vocabulary problem.

📖

Deep dive on BPE

I've covered BPE in full detail — the OOV problem, merge loop, Python implementation, and an interactive visualizer — in a dedicated post.

Read: How Machines Learn to Read →

04

Sliding Window

Create (input, target) training pairs by sliding a fixed-length context window across the token sequence.

// Token sequence from your text

Text (will be tokenized to IDs)

Input window

Target (next token)

INPUT IDs

—

TARGET IDs (shifted +1)

—

Window position 0

Max length 4

Stride 2

// Code

for i in range(0, len(token_ids) - max_length, stride):
    input_chunk  = token_ids[i : i + max_length]
    target_chunk = token_ids[i+1 : i + max_length + 1]
    # target is input shifted right by 1 position

05

Token Embeddings

Project discrete token IDs into a continuous vector space. Each token becomes a learned dense vector.

// Embedding lookup table — hover cells to inspect

          Each row = one token's 16-dim vector (truncated for display). Real GPT-2 uses 768 dims.
        

// 2D PCA projection (similarity map)

Similar tokens cluster together. Distances carry semantic meaning.

// Shape walkthrough

input batch
[8, 4]
batch × seq_len
→
embedding layer
Embedding(50257, 256)
vocab_size × d_model
→
output
[8, 4, 256]
batch × seq_len × d_model

06

Positional Embeddings

Add a learned position signal to each token vector so the transformer knows where each token sits in the sequence.

// Why position matters

❌ Without position

          "cat sat on mat" 
= "mat on sat cat"

          identical embeddings!
        

✓ With position

          "cat"pos=0 ≠ "cat"pos=2

          order is encoded!

// Position embedding heatmap (context_length × d_model)

      Each row = position 0, 1, 2... Each column = one embedding dimension. Patterns emerge during training.
    

// The final addition

Token Embeddings

[8, 4, 256]

what the token IS

+

Positional Embeddings

[4, 256]

where the token IS

=

Input Embeddings

[8, 4, 256]

→ feed to transformer

input_embeddings = token_embeddings + positional_embeddings
# Broadcasting: [8,4,256] + [4,256] → [8,4,256]
print(input_embeddings.shape)  # torch.Size([8, 4, 256])

// The Complete Pipeline

📝

Raw Text

Input String

Any raw text from your corpus. Could be Wikipedia, code, books — anything.

str

✂️

Tokenizer

Regex / BPE → Token IDs

Split text into subword pieces, map each to an integer using the vocabulary.

List[int]

🪟

Windowing

Sliding Window → Batches

Create overlapping (input, target) pairs with fixed context length. Wrap in a DataLoader.

[batch, seq_len]

🔢

Token Embed

nn.Embedding → Dense Vectors

Look up each token ID in a learned embedding table. Integers become floating-point vectors.

[batch, seq_len, d_model]

📍

Pos Embed

+ Positional Embeddings

Add learned position vectors so each token carries both identity and position information.

[batch, seq_len, d_model]

⚡

Transformer

Input Embeddings → Attention

The final tensor is fed into the first transformer layer. Self-attention and beyond.

[batch, seq_len, d_model]

From Words to Tensors

Input String

Regex / BPE → Token IDs

Sliding Window → Batches

nn.Embedding → Dense Vectors

+ Positional Embeddings

Input Embeddings → Attention