Introduction

The first layer of a Transformer model is the input layer, and it typically consists of two major components: tokenization and embeddings. Before the model can perform attention, reasoning, or prediction, it must first understand the input in a structured numerical form. Today, we’ll focus on the first step—tokenization.

Tokenization is the process of breaking down raw text into smaller pieces called tokens, which are then mapped to numerical IDs that the model can work with. But it’s not just simple word splitting—modern tokenization involves sophisticated algorithms like WordPiece, Byte Pair Encoding (BPE), and Unigram, each optimized for different architectures and data distributions.

Why does this matter? Because how you tokenize text affects everything downstream—from model accuracy and efficiency to how well it handles unknown words, rare characters, or multilingual input. A well-designed tokenizer helps the model learn faster and generalize better; a poor one can bottleneck performance.

In this post, we’ll dive into:

How tokenization actually works in practice
The different types of tokenization algorithms
Their pros and cons
Which models (like GPT, BERT, LLaMA, and Mistral) use which tokenizers

By the end, you'll have a clear understanding of how text becomes tokens, and why that transformation is critical to every NLP model you’ve ever used.

General Steps in Tokenization

Before a Transformer model can even begin to process input, the text undergoes several preprocessing steps to ensure it’s in a clean and consistent format. Let’s walk through each stage involved in preparing text for tokenization:

1. Corpus Collection

The first step in building a tokenizer is to gather a large text corpus. This corpus often includes:

Web pages
Books
Articles
User-generated content
This diverse data helps the tokenizer learn a vocabulary that covers a wide range of language usage.

2. Normalization

Once text is collected, it must be normalized. This involves cleaning and standardising the input:

Lower-casing text (e.g., "I Know" → "i know")
Removing accents (e.g., "résumé" → "resume")
Collapsing multiple whitespaces (e.g., "I know" → "I know")
Stripping HTML tags
Handling punctuation or special characters consistently

After normalization:
"Hmm.., I know I Don't know" → "hmm.., i know i don't know"

3. Pre-tokenization

Traditionally, splitting text by whitespace was considered tokenization. However, in modern NLP pipelines, this step is treated as pre-tokenization when used with subword tokenizers.

Pre-tokenization splits text into rough word-like units:

Pre-tokenized:
"hmm.., i know i don't know" → ["hmm..,", "i", "know", "i", "don't", "know"]

This prepares the text for the next, more sophisticated step.

4. Subword Tokenization

Subword tokenization is where modern tokenizers (like BPE, WordPiece, Unigram) come into play. These algorithms:

Break tokens into smaller sub-units based on frequency in the corpus
Handle unknown or rare words more efficiently
Reduce the vocabulary size significantly

After subword tokenization:
["hmm..,", "I", "know", "i", "don't", "know"] → ["hmm..,", "I", "know", "i", "do", "#n't", "know"]

The final output is a sequence of subword units that can be mapped to numerical IDs and passed to the embedding layer.

5. Vocabulary Learning (Training)

Once the subword units are generated, the tokenizer learns a vocabulary from the training corpus. This process involves:

Counting token/subword frequencies
Applying compression or segmentation algorithms
Creating a fixed-size vocabulary used during model training/inference

How is Vocabulary Learned in Tokenization?

After collecting a large text corpus—consisting of real-world sentences from websites, books, or conversations—the first step is preprocessing. This includes normalizing the text: lowercasing, removing extra spaces, standardizing punctuation, and cleaning up special characters.

Once the text is clean, the next critical step is to learn the vocabulary.

What Does "Learning the Vocabulary" Mean?

At this stage, the system scans through the entire preprocessed corpus to identify frequently occurring words and subword patterns. These patterns are not hardcoded or predefined—they emerge naturally from the data, based on how often certain fragments appear in the text.
This fragmentation can happen in different ways, such as:

Splitting by whitespace to treat full words as vocabulary items (e.g., "I", "enjoyed", "movie")
Splitting by characters to form vocabulary at a more granular level (e.g., "e", "n", "j", "o", "y")

The vocabulary is constructed from these frequent units so that any future sentence can be tokenized using combinations of these known pieces.

Example:

Let’s take the sentence:
"I enjoyed the movie"

After preprocessing (e.g., lowercasing and standardizing), it might become:
"i enjoyed the movie"

Now, using whitespace-based splitting, the system treats each word as a candidate vocabulary item. As it scans the entire corpus, it identifies which whole words appear most frequently. For example:

"i" is a very common pronoun
"enjoyed" appears frequently in reviews or narratives
"the" is one of the most common English words
"movie" is frequently used in entertainment-related texts

These full words are added to the vocabulary set V.

So, when a new sentence like "They watched the movie" is encountered, the tokenizer simply splits it by whitespace:

"they watched the movie" → ["they", "watched", "the", "movie"]

If all these words exist in V, the model can directly map them to token IDs.
Otherwise, unknown words (e.g., "watched") may be replaced with a special [UNK] token—unless sub-word strategies are used instead.

Once we understand that vocabulary learning is central to how Transformers read and process text, the next step is to explore the different tokenization techniques used to build that vocabulary. These techniques define how the input text is split and what kind of units (tokens) are included in the vocabulary set.

Each technique approaches tokenization differently, with its own way of breaking down sentences, handling unknown words, and optimizing vocabulary size.

1. Whitespace Tokenization

Whitespace tokenization is the most basic and intuitive method of breaking text into tokens. As the name suggests, this technique simply splits text based on spaces. Each space-separated segment is treated as a distinct token, and the resulting vocabulary is composed of entire words exactly as they appear in the text.

For example, the sentence "They watched the movie" would be tokenized into ["They", "watched", "the", "movie"]. This method is straightforward, requires minimal preprocessing, and is computationally efficient.

Some Questions about Whitespace Tokenization ?

A.Is splitting the input text into words using whitespace a good approach?

It depends on the use case, but generally, it's too simplistic for modern NLP models. Splitting by whitespace works for basic applications and for languages like English that use clear word boundaries. However, this method fails to handle rare words, typos, and word variations, and it completely breaks down for languages that don’t use spaces at all, like Chinese or Japanese. Moreover, it treats every new word form as a new token, which can lead to an explosion in vocabulary size and poor generalization to unseen inputs.

B.Do we treat the words "love" and "loved" as separate tokens?

Yes—if using whitespace tokenization. But this is often not ideal. In whitespace-based tokenization, each unique word form is treated as a separate token. So "love" and "loved" would be distinct entries in the vocabulary. This leads to data sparsity, where the model must learn similar representations for variations of the same root word. Subword-based tokenization helps solve this by splitting "love" into reusable components like "love" and "#ed", improving efficiency and parameter sharing across related tokens.

C.What about languages like Japanese, which do not use any word delimiters like space?

Whitespace-based tokenization doesn’t work for such languages. Specialized techniques are needed. Languages like Japanese, Chinese, and Thai do not use spaces to separate words. In these cases, whitespace tokenization fails entirely. Instead, character-based or subword-based approaches are preferred. These methods can break down text into linguistically meaningful units (like radicals or characters) or into statistically frequent subword chunks, without relying on explicit spacing.

D.Why not treat each individual character in a language as a vocabulary?

It's possible—and sometimes used—but comes with trade-offs. Which is our next topic.

2.Character-Level Tokenization

Character-level tokenization is a simple yet powerful technique where the text is broken down into individual characters, rather than words or subwords. In this method, the vocabulary consists of all possible characters in the corpus, including letters, digits, punctuation marks, and special symbols (e.g., "a", "b", "1", ".", "#").

One of the biggest advantages of this approach is that it creates a very small and universal vocabulary. Since every possible character is part of the vocabulary, the model never encounters unknown tokens, making this method especially robust for handling typos, rare words, or even code and emojis.

However, character-level tokenization comes with its challenges. Because each word is split into many small units, the resulting input sequences become significantly longer. For example, the word "movie" would be tokenized as ["m", "o", "v", "i", "e"]. As a result, the model must process more tokens per sentence and work harder to capture meaningful patterns and long-range dependencies. This often leads to slower training and higher computational cost, and can make learning semantic relationships more difficult compared to word or subword-level tokenization.

Despite these limitations, character-level tokenization can be useful in specific applications like morphologically rich languages, noisy text, or low-resource settings, where other tokenization methods may fail.

Challenges in Building a Vocabulary for Tokenization

Designing an effective vocabulary is not as straightforward as it might seem. It involves a delicate balance between coverage, efficiency, and model performance. Here are some of the major challenges faced when building vocabularies for Transformer models:

A. What Should Be the Size of the Vocabulary?

Choosing the right vocabulary size is a trade-off. A larger vocabulary gives the model access to more complete word representations, reducing the number of unknown or broken-down tokens. However, it comes at a cost:

A larger vocabulary means a bigger embedding matrix, which increases memory usage.
It also adds computational overhead during the softmax operation in the output layer.

On the other hand, a smaller vocabulary saves space and speeds up computation but increases the risk of splitting words too aggressively or encountering unknown words.

Key Question:
What is the optimal vocabulary size that balances memory, speed, and accuracy?

B. Out-of-Vocabulary (OOV) Words

When the vocabulary is restricted—say, from 250,000 tokens down to 50,000—there will inevitably be words in the input text that are not in the vocabulary. These are known as out-of-vocabulary (OOV) words.

If the tokenizer cannot recognize a word, it may replace it with a generic [UNK] (unknown) token. This leads to loss of information, especially in tasks like sentiment analysis or translation where specific words carry important meaning.

Solution:
Modern models use subword tokenization, which breaks unknown words into smaller, known parts, ensuring no word is truly “unknown.”

C. Handling Misspelled Words in the Corpus

Text corpora are often scraped from the web, and naturally, they include spelling mistakes, typos, or non-standard usage. If these mistakes are treated as valid vocabulary entries, they unnecessarily increase the vocabulary size and reduce its quality.

For example, treating "enjooyed" (a misspelled form of “enjoyed”) as a unique token is unhelpful and inefficient.

Solution:
Preprocessing should include spell correction or normalization steps. Additionally, tokenizers that operate at subword or character level are better equipped to handle such noisy inputs.

D. The Open Vocabulary Problem

Languages are constantly evolving, and new words are coined regularly—especially in agglutinative languages (like Turkish or Finnish), where new words can be formed by combining roots and suffixes.

This means that in theory, the number of possible words in a language is infinite. A fixed vocabulary can never fully cover this. This is known as the open vocabulary problem.

Solution:
Subword-based approaches can adapt to this scenario by constructing new words from known pieces. This makes models more flexible and better suited for tasks like machine translation, where word diversity is high.

Final Thought

The challenges in vocabulary design underscore why tokenization is such a critical part of NLP pipelines. The better we handle these edge cases, the more robust and scalable our language models become.

Now, let’s explore subword tokenization — a powerful technique designed to address these vocabulary challenges effectively.

Subword Tokenization: The Best of Both Worlds

Subword tokenization strikes a powerful balance between word-level and character-level tokenization. Instead of treating every word as a whole or breaking everything down into single characters, subword tokenizers split text into frequent and meaningful fragments, known as subwords.

The vocabulary in this method is moderately sized, carefully built by scanning the entire corpus for frequently occurring patterns. Common words are preserved as full tokens (e.g., "know", "movie"), while rare or complex words are broken into smaller, reusable subunits (e.g., "don’t" becomes "do" and "n’t"). This allows the model to represent virtually any word, including those it has never seen before, without resorting to an [UNK] token.

For example, in the sentence:

"Hmm.. I know, I don't know"

The subword tokenizer may rely on a vocabulary like:
V = {Hmm.., I, know, do, n't}
and tokenize the sentence as:
[Hmm.., I, know, I, do, n't, know]

Even though "don't" wasn’t stored as a single token, the tokenizer successfully reconstructs it using known parts. This enables flexibility, compact vocabulary, and robust handling of unseen words—one reason why subword tokenization is the preferred method in modern LLMs like BERT, GPT, and LLaMA.

Subword tokenization is not a one-size-fits-all solution. While its core goal remains the same—balancing vocabulary size with coverage and flexibility—different algorithms approach this goal in different ways. Each method has its own strategy for identifying meaningful subword units based on frequency patterns, data statistics, or probabilistic modeling.

In modern NLP, three subword tokenization techniques stand out as the most widely used:

Byte Pair Encoding (BPE)
WordPiece Tokenization
SentencePiece (Unigram) Tokenization

In the sections that follow, we'll explore how each of these methods works and why they’re favored in popular models like GPT, BERT, and T5.

Byte Pair Encoding (BPE)

Byte Pair Encoding (BPE) is one of the earliest and most popular subword tokenization algorithms used in modern NLP. Originally developed for data compression, BPE was adapted for tokenization to help models handle rare and unseen words more effectively without bloating the vocabulary.

How BPE Works

Initialize the Vocabulary: Each word is split into characters, and each word ends with a special end-of-word symbol like </w>.
Count Pairs: Count all adjacent symbol pairs across the vocabulary.
Merge Most Frequent Pair: Find the most frequent pair and merge it into a new symbol.
Repeat: Continue merging for a fixed number of iterations or until no more merges are possible.

Byte Pair Encoding (BPE) — Step-by-Step Algorithm

Here's how the BPE algorithm works in practice:

Step-by-Step Breakdown:

Initialize the Vocabulary
Start with a dictionary where each key is a word and the value is its frequency (how often it appears in the corpus).
Mark Word Boundaries
Append a special token </w> at the end of each word to indicate word boundaries. This helps distinguish between subwords across different words (e.g., low vs. lower).
Set Merge Count
Decide how many merges to perform — this number is a hyperparameter (num_merges) that controls the final vocabulary size.
Build Symbol Pairs Table
Break each word into individual characters and track the frequency of every adjacent symbol pair (like l o, o w, etc.).
Find Most Frequent Pair
Identify the symbol pair that appears most frequently across all words in the vocabulary.
Merge the Pair
Replace every occurrence of that symbol pair with a new merged token (e.g., l o becomes lo).
Repeat
Go back to Step 4 and repeat the process for the number of times specified in the merge count.

Python Implementation

import re, collections

def get_stats(vocab):
    pairs = collections.defaultdict(int)
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols) - 1):
            pairs[symbols[i], symbols[i + 1]] += freq
    return pairs

def merge_vocab(pair, v_in):
    v_out = {}
    bigram = re.escape(' '.join(pair))
    p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
    for word in v_in:
        w_out = p.sub(''.join(pair), word)
        v_out[w_out] = v_in[word]
    return v_out

# Initial vocabulary: words split into characters
vocab = {
    'l o w </w>': 5,
    'l o w e r </w>': 2,
    'n e w e s t </w>': 6,
    'w i d e s t </w>': 3
}

num_merges = 10

for i in range(num_merges):
    pairs = get_stats(vocab)
    best = max(pairs, key=pairs.get)
    print(f"Step {i+1}: Merging pair {best}")
    vocab = merge_vocab(best, vocab)

Sample Output

Step 1: Merging pair ('e', 's')
Step 2: Merging pair ('es', 't')
Step 3: Merging pair ('n', 'e')
...

To understand BPE in action, let’s use the following sentence as our example:

"knowing the name of something is different from knowing something. knowing something about everything isn't bad"

Step 1: Count Word Frequencies

The very first step is to tokenize the sentence into words and count their occurrences. Each word is represented as a sequence of characters, with a special </w> symbol added at the end to mark the word boundary.

Word Frequency Table

Word	Frequency
`k n o w i n g </w>`	3
`t h e </w>`	1
`n a m e </w>`	1
`o f </w>`	1
`s o m e t h i n g </w>`	2
`i s </w>`	1
`d i f f e r e n t </w>`	1
`f r o m </w>`	1
`s o m e t h i n g . </w>`	1
`a b o u t </w>`	1
`e v e r y t h i n g </w>`	1
`i s n ' t </w>`	1
`b a d </w>`	1

Step 2: Compute Initial Token Frequencies

After splitting each word into characters, we calculate how frequently each character (symbol) appears in the entire vocabulary. This helps us understand which character pairs are most common — the core idea behind Byte Pair Encoding.

Character Frequency Table

Character	Frequency	Explanation
`k`	3	Appears in 3 instances of "knowing"
`n`	13	Appears in "knowing", "name", "something", "isn't", "everything", etc.
`o`	9	Found in "knowing", "of", "something", etc.
`i`	10	Common in "knowing", "is", "different", etc.
`g`	7	Seen in "knowing", "something", "everything"
`</w>`	16	One per word (13 words) + punctuation (like `"something.</w>"`)
...	...	Other characters similarly counted

Initial Vocabulary Size:
Total number of unique individual symbols (characters) = 22

Step 3: Count Frequency of Symbol Pairs (Byte-Pairs)

After building the initial vocabulary of characters, the next step in the BPE algorithm is to count how often adjacent character pairs (or byte-pairs) occur across all words. These are candidates for merging.

Byte-Pair Frequency Table

Symbol Pair	Frequency	Explanation
(`k`, `n`)	3	From "knowing" ×3
(`n`, `o`)	3	From "knowing", "something", "not"
(`o`, `w`)	3	From "knowing", "something", etc.
(`w`, `i`)	3	Appears in “knowing”
(`i`, `n`)	7	Appears in many words like “knowing”, “isn’t”, “something”
(`n`, `g`)	7	Often ends “-ing”
(`g`, `</w>`)	6	Word-ending in “knowing”, “something”, etc.
(`t`, `h`)	5	From “the”, “something”, “everything”
...	...	And so on...

The most frequent byte-pair here is (i, n) with a count of 7. This means "in" appears a lot and should be merged into a single token in the next step.

4.Merge the Most Frequent Byte-Pair `('i', 'n')`

The most frequent pair was 'i' and 'n' with a frequency of 7. So we merge all instances of 'i' followed by 'n' into a single token 'in'.

Result After First Merge

Updated Words

All instances of 'i' + 'n' are replaced with 'in':

Word Before	Word After
`k n o w i n g </w>`	`k n o w in g </w>`
`s o m e t h i n g </w>`	`s o m e t h in g </w>`
`e v e r y t h i n g </w>`	`e v e r y t h in g </w>`
`i s n ' t </w>`	`i s in ' t </w>`
...	...

Vocabulary Update

Token	Frequency
`in`	7
`i`	10 - 7 = 3
`n`	13 - 7 = 6
(others remain unchanged)

Table for Blog (Post-Merge State)

Word List After 1st Merge

Word	Frequency
`k n o w in g </w>`	3
`t h e </w>`	1
`n a m e </w>`	1
`o f </w>`	1
`s o m e t h in g </w>`	2
`i s </w>`	1
`d i f f e r e n t </w>`	1
`f r o m </w>`	1
`s o m e t h in g . </w>`	1
`a b o u t </w>`	1
`e v e r y t h in g </w>`	1
`i s in ' t </w>`	1
`b a d </w>`	1

Repeat these steps until either the maximum number of merges is reached or there are no more frequent pairs left to merge.

After 45 merges this is what the vocabulary looks like.

Tokens
'k'
'n'
'o'
'i'
'</w>'

….

…
'in'
'ing'

……

……
'knowing</w>'
'the</w>'
'name</w>'
'of</w>'
'something</w>'
'is</w>'
'different</w>'
'from</w>'
'something.'</w>'
'about</w>'
'everyth'
'isn't</w>'
'had</w>'

2.WordPiece Tokenization: Smarter Subwords for NLP

After Byte Pair Encoding (BPE), another widely used subword tokenization algorithm is WordPiece — the one originally used by BERT and other Transformer-based models from Google.

While WordPiece shares some similarity with BPE, it introduces a few key differences that make it more linguistically and statistically robust.

It breaks down words into the most likely combination of known subword units — even if the full word isn’t in the vocabulary.

How WordPiece Tokenization works:

In Byte Pair Encoding (BPE), we aim to merge the pair of tokens that occurs most frequently in the current vocabulary. This iterative process helps in building subword units that are common in the text corpus, making tokenization more efficient.

But what happens when multiple token pairs occur with the same frequency?

The Tie-Breaker Problem

Let’s consider this example from the table:

Pair ('i', 'n') occurs 7 times
Pair ('n', 'g') also occurs 7 times

So, which pair should we merge first?

The Solution: Frequency-Aware Scoring

To resolve ties, BPE introduces a scoring function that takes into account not just the frequency of the pair, but also the individual frequencies of the tokens involved.

The score is calculated as:

$$\text{score} = \frac{\text{count}(\alpha, \beta)}{\text{count}(\alpha) \cdot \text{count}(\beta)}$$

Where:

$$\begin{aligned} \alpha, \beta & \quad \text{are the tokens in the pair} \\ \text{count}(\alpha, \beta) & \quad \text{is the frequency of the token pair} \\ \text{count}(\alpha),\ \text{count}(\beta) & \quad \text{are the frequencies of the individual tokens} \end{aligned}$$

Why This Works

This method favors merging rare tokens, under the assumption that rare token pairs form more meaningful new units. If both tokens in a pair are common, their product count(α)⋅count(β)\text{count}(\alpha) \cdot \text{count}(\beta)count(α)⋅count(β) will be high, which lowers the score.

Hence, we select the pair with the highest score—which often means the most contextually relevant and least redundant merge.

Example Table

Token Pair	Frequency
('i', 'n')	7
('n', 'g')	7
('t', 'h')	5
('k', 'n')	3
('n', 'o')	3
...	...

Even though ('i', 'n') and ('n', 'g') have the same frequency, the one with the lower individual token frequencies will be prioritized.

Example Table from Corpus

Here is a sample score table from a BPE step:

Token Pair	Pair Freq	Freq(α)	Freq(β)	Score
('k', 'n')	3	3	13	0.076
('n', 'o')	3	13	9	0.020
('o', 'w')	3	9	3	0.111
('w', 'i')	3	–	–	–
('i', 'n')	7	10	13	0.050
('n', 'g')	7	13	7	0.076
('g', '.')	1	–	–	–
('t', 'h')	5	8	5	0.125
('h', 'e')	1	–	–	–
('e', '</w>')	2	–	–	–
('a', 'd')	1	3	2	0.16

Merge Decision: The pair ('a', 'd') has the highest score (0.16) and is chosen for the next merge.

3.SentencePiece Tokenizer: A Modern Subword Tokenizer

Tokenization is a foundational step in natural language processing (NLP), where text is broken into smaller units—typically words or subwords. Traditional approaches rely on whitespace and language-specific rules, but they struggle with multilingual texts, typos, or unseen words.

SentencePiece offers a flexible and language-independent solution that works directly on raw text, supporting both BPE and Unigram algorithms.

Motivation: Why SentencePiece?

Unlike standard tokenizers that split on spaces, SentencePiece treats input as a raw character sequence. This lets it work for languages without spaces (e.g., Japanese, Chinese), and for tasks where space is unreliable (e.g., OCR, code, user queries).

It also supports probabilistic tokenization, helping generate multiple tokenization paths—a huge advantage in training robust models.

A Word Can Have Many Subword Segmentations

Let’s explore how SentencePiece (or BPE) handles the word "hello" using a fixed vocabulary.

Assume our vocabulary is:

$$\mathcal{V} = \{ h, e, l, o, he, el, lo, ll, hell \}$$

There are multiple valid segmentations for the word "hello":

$$\begin{aligned} \mathbf{x_1} &= \text{he},\ \text{ll},\ \text{o} \\ \mathbf{x_2} &= \text{h},\ \text{el},\ \text{lo} \\ \mathbf{x_3} &= \text{he},\ \text{l},\ \text{lo} \\ \mathbf{x_4} &= \text{hell},\ \text{o} \end{aligned}$$

Even though all are valid, BPE is greedy and deterministic, so it will pick just one—typically the leftmost or longest match first.
Output (deterministic):

he, l, lo

Greedy vs. Probabilistic

SentencePiece can go beyond greedy selection. In probabilistic mode, it samples among possible subword segmentations.

Greedy:

Always returns the same subword split.
Fast, reproducible.
Good for inference.

Probabilistic:

Returns different valid segmentations each time.
Useful for data augmentation during training.
Enabled via BPE-Dropout or Unigram LM sampling.

Probabilistic Tokenization Objective

The goal is to find the best segmentation x* from all possible segmentations { x1, x2, ..., xk } that maximizes the likelihood of generating the observed word X:

$$\mathbf{x^*} = \arg\max_{\mathbf{x}} \Pr(\mathbf{x} \mid X)$$

Where:

X: the word (i.e., a sequence of characters)
x: a possible segmentation of X into subwords

The tokenizer models this process as a Hidden Markov Model (HMM), where the observed word X can correspond to many possible hidden tokenization paths.

SentencePiece’s Unigram Language Model is particularly powerful in capturing this uncertainty during training.

Explanation:

Let x denote a subword sequence of length n:

$$x = (x_1, x_2, \ldots, x_n)$$

The probability of this subword sequence under a Unigram Language Model is computed as the product of individual subword probabilities:

$$P(x) = \prod_{i=1}^{n} P(x_i)$$

And since it's a probability distribution, it must satisfy:

$$\sum_{x \in V} P(x) = 1$$

Objective: Find the Most Likely Segmentation

Given an input sequence X, the objective is to find the best subword segmentation x* from all possible segmentations S(X) that maximizes the sequence probability:

$$x^* = \arg\max_{x \in S(X)} P(x)$$

This is a classic maximum likelihood problem. To solve it efficiently, SentencePiece uses Viterbi decoding to select the best possible path from the segmentation space.

Dataset-Wide Likelihood Function

To train the model, we want to maximize the log-likelihood over an entire dataset D. That is:

$$L = \sum_{s=1}^{|D|} \log P(X_s)$$

But since each Xₛ has many valid subword segmentations, we marginalize over all of them:

$$L = \sum_{s=1}^{|D|} \log \left( \sum_{x \in S(X_s)} P(x) \right)$$

This introduces hidden variables—the true subword paths that generated each observed word are unknown.

EM Algorithm to the Rescue

Since we’re working with hidden subword paths, SentencePiece applies the Expectation-Maximization (EM) algorithm:

E-Step: Estimate the contribution (responsibility) of each possible subword segmentation.
M-Step: Update the model to better fit the expected probabilities.

This iterative process continues until the model converges, leading to highly optimized subword vocabularies.

Example

Word	Frequency
'knowing'	3
'the'	1
'name'	1
'of'	1
'something'	2
'is'	1
'different'	1
'from'	1
'something.'	1
'about'	1
'everything'	1
"isn't"	1
'bad'	1

Let’s say we want to tokenize the word:

$$X = \text{"knowing"}$$

Now, suppose the model provides the following segmentation candidates:

$$S(X) = \left\{ \{ \text{"k"}, \text{"now"}, \text{"ing"} \},\ \{ \text{"know"}, \text{"ing"} \},\ \{ \text{"knowing"} \} \right\}$$

These are three different ways the word "knowing" could be split into subwords.

Segment Probabilities from Vocabulary

Based on the model’s learned vocabulary we can assign probabilities to each token. The probability of a full segmentation is the product of its subwords' probabilities:

Option 1:

$$x_1 = \{ \text{"k"}, \text{"now"}, \text{"ing"} \}$$

$$P(x_1) = P(\text{"k"}) \times P(\text{"now"}) \times P(\text{"ing"}) = \frac{3}{16} \times \frac{3}{16} \times \frac{7}{16} = \frac{63}{4096}$$

Option 2:

$$x_2 = \{ \text{"know"}, \text{"ing"} \}$$

$$P(x_1) = P(\text{"k"}) \times P(\text{"now"}) \times P(\text{"ing"}) = \frac{3}{16} \times \frac{3}{16} \times \frac{7}{16} = \frac{63}{4096}$$

Option 3:

$$x_3 = \{ \text{"knowing"} \}$$

$$P(x_3) = P(\text{"knowing"}) = \frac{768}{4096}$$

Best Segmentation (Maximum Likelihood)

Comparing the three probabilities:

x₁: 63 / 4096
x₂: 336 / 4096
x₃: 768 / 4096 ← highest

Thus, the model selects:

$$x^* = \arg\max_{x \in S(X)} P(x) = x_3$$

Which means SentencePiece will tokenize "knowing" as a single subword "knowing" based on the learned probabilities.

Algorithm

The algorithm works using Expectation-Maximization (EM) steps, combined with a vocabulary pruning step to iteratively reduce subwords that contribute the least to the model likelihood.

Step-by-Step Algorithm

Initialize Vocabulary
Construct a reasonably large seed vocabulary using methods like Byte-Pair Encoding (BPE) or Extended Suffix Array. This gives you an overcomplete set of subwords to start with.
E-Step (Expectation Step)
Estimate the probability of each subword/token in the current vocabulary.
This is typically done using frequency counts from the training corpus.
M-Step (Maximization Step)
Use the Viterbi algorithm to segment the corpus.
Find the best subword sequence for each word that maximizes the (log) likelihood under the current subword probabilities.
Compute Likelihoods
For each subword, compute its contribution to the total corpus likelihood based on how often it appears in the optimal segmentations.
Prune Vocabulary
Identify and remove the top x% of subwords with the smallest likelihood contribution.
This reduces the vocabulary size gradually while preserving the most informative subwords.
Repeat
Go back to Step 2 and repeat the process until the vocabulary size reaches the desired target.

Pseudocode

initialize_vocab = build_seed_vocab(method="BPE", size=large)
while len(initialize_vocab) > target_vocab_size:
    # E-step
    probabilities = estimate_token_probabilities(initialize_vocab, corpus)

    # M-step
    segmentations = viterbi_segment(corpus, probabilities)

    # Compute likelihoods
    likelihoods = compute_likelihoods(segmentations)

    # Prune least useful subwords
    initialize_vocab = prune_vocab(initialize_vocab, likelihoods, top_x_percent_to_remove)

Tokenization Techniques Comparison Table

Tokenizer Type	Used In Models	Pros	Cons
Word-level	None in modern LLMs	- Simple and intuitive- Easy to implement	- Cannot handle OOV words- Large vocab size
Character-level	Some RNNs, early models	- Tiny vocab- No OOV- Handles misspellings	- Very long sequences- Hard to learn meaning
BPE	GPT-2, GPT-3, RoBERTa, LLaMA	- Efficient- Reuses frequent subwords- No `[UNK]` token	- Greedy merges- Not probabilistic
WordPiece	BERT, DistilBERT	- Good balance of vocab vs coverage- Handles rare words well	- Slightly slower due to scoring- Needs pretraining
SentencePiece	T5, ALBERT, mT5	- No whitespace dependence- Supports multilingual corpora- Probabilistic	- Slightly complex training- May split too aggressively

Quick Notes:

Subword tokenizers like BPE, WordPiece, and SentencePiece are preferred in modern Transformers due to their balance of efficiency and generalization.
Pure word-level and character-level tokenization are mostly outdated, though they still have niche applications.

Command Palette

Introduction

General Steps in Tokenization

1. Corpus Collection

2. Normalization

3. Pre-tokenization

4. Subword Tokenization

5. Vocabulary Learning (Training)

How is Vocabulary Learned in Tokenization?

What Does "Learning the Vocabulary" Mean?

Example:

1. Whitespace Tokenization

Some Questions about Whitespace Tokenization ?

A.Is splitting the input text into words using whitespace a good approach?

B.Do we treat the words "love" and "loved" as separate tokens?

C.What about languages like Japanese, which do not use any word delimiters like space?

D.Why not treat each individual character in a language as a vocabulary?

2.Character-Level Tokenization

Challenges in Building a Vocabulary for Tokenization

A. What Should Be the Size of the Vocabulary?

B. Out-of-Vocabulary (OOV) Words

C. Handling Misspelled Words in the Corpus

D. The Open Vocabulary Problem

Final Thought

Subword Tokenization: The Best of Both Worlds

Byte Pair Encoding (BPE)

How BPE Works

Byte Pair Encoding (BPE) — Step-by-Step Algorithm

Step-by-Step Breakdown:

Python Implementation

Step 1: Count Word Frequencies

Word Frequency Table

Step 2: Compute Initial Token Frequencies

Character Frequency Table

Step 3: Count Frequency of Symbol Pairs (Byte-Pairs)

Byte-Pair Frequency Table

4.Merge the Most Frequent Byte-Pair ('i', 'n')

Result After First Merge

Updated Words

Vocabulary Update

Table for Blog (Post-Merge State)

Word List After 1st Merge

2.WordPiece Tokenization: Smarter Subwords for NLP

How WordPiece Tokenization works:

The Tie-Breaker Problem

The Solution: Frequency-Aware Scoring

Example Table

Example Table from Corpus

3.SentencePiece Tokenizer: A Modern Subword Tokenizer

Motivation: Why SentencePiece?

A Word Can Have Many Subword Segmentations

Greedy vs. Probabilistic

Greedy:

Probabilistic:

Probabilistic Tokenization Objective

Explanation:

Objective: Find the Most Likely Segmentation

Dataset-Wide Likelihood Function

EM Algorithm to the Rescue

Example

Segment Probabilities from Vocabulary

Option 1:

Option 3:

Best Segmentation (Maximum Likelihood)

Algorithm

Step-by-Step Algorithm

Pseudocode

Tokenization Techniques Comparison Table

Comments

4.Merge the Most Frequent Byte-Pair `('i', 'n')`