NLP Pipeline · Interactive Explainer

From Words to Tensors

A visual, hands-on journey through tokenization — from splitting raw text all the way to positional embeddings ready for a transformer.

SCROLL TO EXPLORE
01
Regex Tokenization
The simplest approach — split text on punctuation and whitespace using a regular expression pattern.
// Try it — type any sentence below
Tokens
Token Chips (hover for index)
// What the code does
import re

text = "Hello, world!"

# Pattern: word chars OR any non-whitespace
tokens = re.findall(
  r'\w+|[^\w\s]', text
)
# → ['Hello', ',', 'world', '!']
// Why not just split on spaces?

A naive .split() throws away punctuation entirely — "don't" becomes one token, and commas vanish. The regex pattern captures both words and punctuation as individual tokens, preserving all information.

❌ "don't" → ["don't"] ✓ "don't" → ["don", "'", "t"]
02
Vocabulary Building
Map every unique token to an integer. Add special tokens for unknowns and document boundaries.
// Build vocabulary from your text
0
vocab size
0
total tokens
Special tokens always added:
<|unk|> <|endoftext|>
Type a word not in corpus:
// Vocabulary map (token → id)
03
Byte Pair Encoding
Learn subword tokens by iteratively merging the most frequent adjacent pairs — solving the out-of-vocabulary problem.
📖
Deep dive on BPE
I've covered BPE in full detail — the OOV problem, merge loop, Python implementation, and an interactive visualizer — in a dedicated post.
Read: How Machines Learn to Read →
04
Sliding Window
Create (input, target) training pairs by sliding a fixed-length context window across the token sequence.
// Token sequence from your text
Input window
Target (next token)
INPUT IDs
TARGET IDs (shifted +1)
0
4
2
// Code
for i in range(0, len(token_ids) - max_length, stride):
    input_chunk  = token_ids[i : i + max_length]
    target_chunk = token_ids[i+1 : i + max_length + 1]
    # target is input shifted right by 1 position
05
Token Embeddings
Project discrete token IDs into a continuous vector space. Each token becomes a learned dense vector.
// Embedding lookup table — hover cells to inspect
Each row = one token's 16-dim vector (truncated for display). Real GPT-2 uses 768 dims.
// 2D PCA projection (similarity map)
Similar tokens cluster together. Distances carry semantic meaning.
// Shape walkthrough
input batch
[8, 4]
batch × seq_len
embedding layer
Embedding(50257, 256)
vocab_size × d_model
output
[8, 4, 256]
batch × seq_len × d_model
06
Positional Embeddings
Add a learned position signal to each token vector so the transformer knows where each token sits in the sequence.
// Why position matters
❌ Without position
"cat sat on mat"
= "mat on sat cat"
identical embeddings!
✓ With position
"cat"pos=0 ≠ "cat"pos=2
order is encoded!
// Position embedding heatmap (context_length × d_model)
Each row = position 0, 1, 2... Each column = one embedding dimension. Patterns emerge during training.
// The final addition
Token Embeddings
[8, 4, 256]
what the token IS
+
Positional Embeddings
[4, 256]
where the token IS
=
Input Embeddings
[8, 4, 256]
→ feed to transformer
input_embeddings = token_embeddings + positional_embeddings
# Broadcasting: [8,4,256] + [4,256] → [8,4,256]
print(input_embeddings.shape)  # torch.Size([8, 4, 256])
// The Complete Pipeline
📝
Raw Text

Input String

Any raw text from your corpus. Could be Wikipedia, code, books — anything.

str
✂️
Tokenizer

Regex / BPE → Token IDs

Split text into subword pieces, map each to an integer using the vocabulary.

List[int]
🪟
Windowing

Sliding Window → Batches

Create overlapping (input, target) pairs with fixed context length. Wrap in a DataLoader.

[batch, seq_len]
🔢
Token Embed

nn.Embedding → Dense Vectors

Look up each token ID in a learned embedding table. Integers become floating-point vectors.

[batch, seq_len, d_model]
📍
Pos Embed

+ Positional Embeddings

Add learned position vectors so each token carries both identity and position information.

[batch, seq_len, d_model]
Transformer

Input Embeddings → Attention

The final tensor is fed into the first transformer layer. Self-attention and beyond.

[batch, seq_len, d_model]