A visual, hands-on journey through tokenization — from splitting raw text all the way to positional embeddings ready for a transformer.
import re text = "Hello, world!" # Pattern: word chars OR any non-whitespace tokens = re.findall( r'\w+|[^\w\s]', text ) # → ['Hello', ',', 'world', '!']
A naive .split() throws away punctuation entirely — "don't" becomes one token, and commas vanish. The regex pattern captures both words and punctuation as individual tokens, preserving all information.
for i in range(0, len(token_ids) - max_length, stride): input_chunk = token_ids[i : i + max_length] target_chunk = token_ids[i+1 : i + max_length + 1] # target is input shifted right by 1 position
input_embeddings = token_embeddings + positional_embeddings # Broadcasting: [8,4,256] + [4,256] → [8,4,256] print(input_embeddings.shape) # torch.Size([8, 4, 256])
Any raw text from your corpus. Could be Wikipedia, code, books — anything.
strSplit text into subword pieces, map each to an integer using the vocabulary.
List[int]Create overlapping (input, target) pairs with fixed context length. Wrap in a DataLoader.
[batch, seq_len]Look up each token ID in a learned embedding table. Integers become floating-point vectors.
[batch, seq_len, d_model]Add learned position vectors so each token carries both identity and position information.
[batch, seq_len, d_model]The final tensor is fed into the first transformer layer. Self-attention and beyond.
[batch, seq_len, d_model]