<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[What Happens Before Attention?]]></title><description><![CDATA[What Happens Before Attention?]]></description><link>https://what-happens-before-attention.hashnode.dev</link><generator>RSS for Node</generator><lastBuildDate>Mon, 22 Jun 2026 16:47:12 GMT</lastBuildDate><atom:link href="https://what-happens-before-attention.hashnode.dev/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Understanding the Transformer Input Layer: Tokenization Explained.]]></title><description><![CDATA[Introduction
The first layer of a Transformer model is the input layer, and it typically consists of two major components: tokenization and embeddings. Before the model can perform attention, reasoning, or prediction, it must first understand the inp...]]></description><link>https://what-happens-before-attention.hashnode.dev/understanding-the-transformer-input-layer-tokenization-explained</link><guid isPermaLink="true">https://what-happens-before-attention.hashnode.dev/understanding-the-transformer-input-layer-tokenization-explained</guid><category><![CDATA[Tokenization]]></category><category><![CDATA[transformers]]></category><category><![CDATA[nlp]]></category><dc:creator><![CDATA[Tanayendu Bari]]></dc:creator><pubDate>Thu, 19 Jun 2025 00:15:16 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/vruAZdZzQR0/upload/10101cf9bebe5f420b552665f3f4de39.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-introduction"><em>Introduction</em></h1>
<p>The first layer of a Transformer model is the <strong>input layer</strong>, and it typically consists of <strong>two major components</strong>: <strong>tokenization</strong> and <strong>embeddings</strong>. Before the model can perform attention, reasoning, or prediction, it must first understand the input in a structured numerical form. Today, we’ll focus on the first step—<strong>tokenization</strong>.</p>
<p>Tokenization is the process of breaking down raw text into smaller pieces called <strong>tokens</strong>, which are then mapped to numerical IDs that the model can work with. But it’s not just simple word splitting—modern tokenization involves sophisticated algorithms like <strong>WordPiece</strong>, <strong>Byte Pair Encoding (BPE)</strong>, and <strong>Unigram</strong>, each optimized for different architectures and data distributions.</p>
<p><img src="https://imgs.search.brave.com/9pDLn_K1n6Er5qnKy7OAN5Q-E7S2TC205CyZ2fEL44E/rs:fit:860:0:0:0/g:ce/aHR0cHM6Ly9tZWRp/YS5iZWVoaWl2LmNv/bS9jZG4tY2dpL2lt/YWdlL2ZpdD1zY2Fs/ZS1kb3duLGZvcm1h/dD1hdXRvLG9uZXJy/b3I9cmVkaXJlY3Qs/cXVhbGl0eT04MC91/cGxvYWRzL2Fzc2V0/L2ZpbGUvMzk2MmE1/MDgtZDgyMy00ZjMy/LThjNTYtODZjMzBl/MDdmMjRkL3dlX2xv/dmVfbmxwX18xXy5w/bmc" alt /></p>
<p>Why does this matter? Because <strong>how you tokenize text affects everything downstream</strong>—from model accuracy and efficiency to how well it handles unknown words, rare characters, or multilingual input. A well-designed tokenizer helps the model learn faster and generalize better; a poor one can bottleneck performance.</p>
<p>In this post, we’ll dive into:</p>
<ul>
<li><p>How tokenization actually works in practice</p>
</li>
<li><p>The different types of tokenization algorithms</p>
</li>
<li><p>Their pros and cons</p>
</li>
<li><p>Which models (like GPT, BERT, LLaMA, and Mistral) use which tokenizers</p>
</li>
</ul>
<p>By the end, you'll have a clear understanding of how text becomes tokens, and why that transformation is critical to every NLP model you’ve ever used.</p>
<h1 id="heading-general-steps-in-tokenization"><em>General Steps in Tokenization</em></h1>
<p>Before a Transformer model can even begin to process input, the text undergoes several <strong>preprocessing steps</strong> to ensure it’s in a clean and consistent format. Let’s walk through each stage involved in preparing text for tokenization:</p>
<hr />
<h3 id="heading-1-corpus-collection">1. <strong>Corpus Collection</strong></h3>
<p>The first step in building a tokenizer is to gather a <strong>large text corpus</strong>. This corpus often includes:</p>
<ul>
<li><p>Web pages</p>
</li>
<li><p>Books</p>
</li>
<li><p>Articles</p>
</li>
<li><p>User-generated content<br />  This diverse data helps the tokenizer learn a vocabulary that covers a wide range of language usage.</p>
</li>
</ul>
<hr />
<h3 id="heading-2-normalization">2. <strong>Normalization</strong></h3>
<p>Once text is collected, it must be normalized. This involves cleaning and standardising the input:</p>
<ul>
<li><p><strong>Lower-casing text</strong> (e.g., "I Know" → "i know")</p>
</li>
<li><p><strong>Removing accents</strong> (e.g., "résumé" → "resume")</p>
</li>
<li><p><strong>Collapsing multiple whitespaces</strong> (e.g., "I   know" → "I know")</p>
</li>
<li><p><strong>Stripping HTML tags</strong></p>
</li>
<li><p><strong>Handling punctuation or special characters consistently</strong></p>
</li>
</ul>
<blockquote>
<p>After normalization:<br /><code>"Hmm.., I know I Don't know"</code> → <code>"hmm.., i know i don't know"</code></p>
</blockquote>
<hr />
<h3 id="heading-3-pre-tokenization">3. <strong>Pre-tokenization</strong></h3>
<p>Traditionally, splitting text by whitespace was considered tokenization. However, in modern NLP pipelines, this step is treated as <strong>pre-tokenization</strong> when used with subword tokenizers.</p>
<ul>
<li><strong>Pre-tokenization</strong> splits text into rough word-like units:</li>
</ul>
<blockquote>
<p>Pre-tokenized:<br /><code>"hmm.., i know i don't know"</code> → <code>["hmm..,", "i", "know", "i", "don't", "know"]</code></p>
</blockquote>
<p>This prepares the text for the next, more sophisticated step.</p>
<hr />
<h3 id="heading-4-subword-tokenization">4. <strong>Subword Tokenization</strong></h3>
<p>Subword tokenization is where modern tokenizers (like BPE, WordPiece, Unigram) come into play. These algorithms:</p>
<ul>
<li><p>Break tokens into <strong>smaller sub-units</strong> based on frequency in the corpus</p>
</li>
<li><p>Handle unknown or rare words more efficiently</p>
</li>
<li><p>Reduce the vocabulary size significantly</p>
</li>
</ul>
<blockquote>
<p>After subword tokenization:<br /><code>["hmm..,", "I", "know", "i", "don't", "know"]</code> → <code>["hmm..,", "I", "know", "i", "do", "#n't", "know"]</code></p>
<p>The final output is a sequence of subword units that can be mapped to numerical IDs and passed to the embedding layer.</p>
</blockquote>
<hr />
<h3 id="heading-5-vocabulary-learning-training">5. <strong>Vocabulary Learning (Training)</strong></h3>
<p>Once the subword units are generated, the tokenizer <strong>learns a vocabulary</strong> from the training corpus. This process involves:</p>
<ul>
<li><p>Counting token/subword frequencies</p>
</li>
<li><p>Applying compression or segmentation algorithms</p>
</li>
<li><p>Creating a fixed-size vocabulary used during model training/inference</p>
</li>
</ul>
<h1 id="heading-how-is-vocabulary-learned-in-tokenization"><em>How is Vocabulary Learned in Tokenization?</em></h1>
<p>After collecting a large <strong>text corpus</strong>—consisting of real-world sentences from websites, books, or conversations—the first step is <strong>preprocessing</strong>. This includes normalizing the text: lowercasing, removing extra spaces, standardizing punctuation, and cleaning up special characters.</p>
<p>Once the text is clean, the next critical step is to <strong>learn the vocabulary</strong>.</p>
<hr />
<h3 id="heading-what-does-learning-the-vocabulary-mean">What Does "Learning the Vocabulary" Mean?</h3>
<p>At this stage, the system scans through the entire <strong>preprocessed corpus</strong> to identify <strong>frequently occurring words and subword patterns</strong>. These patterns are not hardcoded or predefined—they <strong>emerge naturally from the data</strong>, based on how often certain fragments appear in the text.<br />This fragmentation can happen in <strong>different ways</strong>, such as:</p>
<ul>
<li><p><strong>Splitting by whitespace</strong> to treat full words as vocabulary items (e.g., <code>"I"</code>, <code>"enjoyed"</code>, <code>"movie"</code>)</p>
</li>
<li><p><strong>Splitting by characters</strong> to form vocabulary at a more granular level (e.g., <code>"e"</code>, <code>"n"</code>, <code>"j"</code>, <code>"o"</code>, <code>"y"</code>)</p>
</li>
</ul>
<p>The vocabulary is constructed from these frequent units so that any future sentence can be tokenized using combinations of these known pieces.</p>
<hr />
<h3 id="heading-example">Example:</h3>
<p>Let’s take the sentence:<br /><code>"I enjoyed the movie"</code></p>
<p>After preprocessing (e.g., lowercasing and standardizing), it might become:<br /><code>"i enjoyed the movie"</code></p>
<p>Now, using <strong>whitespace-based splitting</strong>, the system treats each word as a candidate vocabulary item. As it scans the entire corpus, it identifies which <strong>whole words</strong> appear most frequently. For example:</p>
<ul>
<li><p><code>"i"</code> is a very common pronoun</p>
</li>
<li><p><code>"enjoyed"</code> appears frequently in reviews or narratives</p>
</li>
<li><p><code>"the"</code> is one of the most common English words</p>
</li>
<li><p><code>"movie"</code> is frequently used in entertainment-related texts</p>
</li>
</ul>
<p>These full words are added to the vocabulary set <strong>V</strong>.</p>
<p>So, when a new sentence like <code>"They watched the movie"</code> is encountered, the tokenizer simply splits it by whitespace:</p>
<blockquote>
<p><code>"they watched the movie"</code> → <code>["they", "watched", "the", "movie"]</code></p>
</blockquote>
<p>If all these words exist in <strong>V</strong>, the model can directly map them to token IDs.<br />Otherwise, unknown words (e.g., <code>"watched"</code>) may be replaced with a special <code>[UNK]</code> token—<strong>unless</strong> sub-word strategies are used instead.</p>
<p>sh</p>
<p>Once we understand that vocabulary learning is central to how Transformers read and process text, the next step is to explore the <strong>different tokenization techniques</strong> used to build that vocabulary. These techniques define <strong>how the input text is split</strong> and <strong>what kind of units (tokens)</strong> are included in the vocabulary set.</p>
<p>Each technique approaches tokenization differently, with its own way of breaking down sentences, handling unknown words, and optimizing vocabulary size.</p>
<h2 id="heading-1-whitespace-tokenization">1. <strong>Whitespace Tokenization</strong></h2>
<p><strong>Whitespace tokenization</strong> is the most basic and intuitive method of breaking text into tokens. As the name suggests, this technique simply splits text based on spaces. Each space-separated segment is treated as a distinct token, and the resulting <strong>vocabulary</strong> is composed of entire words exactly as they appear in the text.</p>
<p>For example, the sentence <code>"They watched the movie"</code> would be tokenized into <code>["They", "watched", "the", "movie"]</code>. This method is straightforward, requires minimal preprocessing, and is computationally efficient.</p>
<p><img src="https://imgs.search.brave.com/0XoMXxOiLXMoesncdgducFoweTiqiyWjHOiDhJN7u8c/rs:fit:860:0:0:0/g:ce/aHR0cHM6Ly9zbWx0/YXIuY29tL2RpYWdy/YW0tZmlsZXMvdG9r/ZW5pemF0aW9uLWJs/YWNrLWJveC5wbmc" alt="A black box representation of a tokenizer. The text of these three example text fragments has been converted to lowercase and punctuation has been removed before the text is split." /></p>
<h3 id="heading-some-questions-about-whitespace-tokenization"><em>Some Questions about</em> <strong>Whitespace Tokenization ?</strong></h3>
<h3 id="heading-ais-splitting-the-input-text-into-words-using-whitespace-a-good-approach"><strong>A.Is splitting the input text into words using whitespace a good approach?</strong></h3>
<p>It depends on the use case, but generally, it's too simplistic for modern NLP models. Splitting by whitespace works for basic applications and for languages like English that use clear word boundaries. However, this method <strong>fails to handle rare words, typos, and word variations</strong>, and it completely breaks down for languages that <strong>don’t use spaces at all</strong>, like Chinese or Japanese. Moreover, it treats every new word form as a new token, which can lead to an <strong>explosion in vocabulary size</strong> and poor generalization to unseen inputs.</p>
<h3 id="heading-bdo-we-treat-the-words-love-and-loved-as-separate-tokens">B.Do we treat the words "<em>love</em>" and "<em>loved</em>" as separate tokens?</h3>
<p>Yes—if using whitespace tokenization. But this is often not ideal. In whitespace-based tokenization, each unique word form is treated as a separate token. So <code>"love"</code> and <code>"loved"</code> would be distinct entries in the vocabulary. This leads to <strong>data sparsity</strong>, where the model must learn similar representations for variations of the same root word. Subword-based tokenization helps solve this by splitting <code>"love"</code> into reusable components like <code>"love"</code> and <code>"#ed"</code>, improving <strong>efficiency</strong> and <strong>parameter sharing</strong> across related tokens.</p>
<h3 id="heading-cwhat-about-languages-like-japanese-which-do-not-use-any-word-delimiters-like-space"><strong>C.What about languages like Japanese, which do not use any word delimiters like space?</strong></h3>
<p>Whitespace-based tokenization doesn’t work for such languages. Specialized techniques are needed. Languages like <strong>Japanese, Chinese, and Thai</strong> do not use spaces to separate words. In these cases, whitespace tokenization fails entirely. Instead, <strong>character-based</strong> or <strong>subword-based</strong> approaches are preferred. These methods can break down text into <strong>linguistically meaningful units</strong> (like radicals or characters) or into statistically frequent subword chunks, without relying on explicit spacing.</p>
<h3 id="heading-dwhy-not-treat-each-individual-character-in-a-language-as-a-vocabulary"><strong>D.Why not treat each individual character in a language as a vocabulary?</strong></h3>
<p>It's possible—and sometimes used—but comes with trade-offs. Which is our next topic.</p>
<h2 id="heading-2character-level-tokenization">2.Character-Level Tokenization</h2>
<p>Character-level tokenization is a simple yet powerful technique where the text is broken down into <strong>individual characters</strong>, rather than words or subwords. In this method, the vocabulary consists of <strong>all possible characters</strong> in the corpus, including <strong>letters</strong>, <strong>digits</strong>, <strong>punctuation marks</strong>, and <strong>special symbols</strong> (e.g., <code>"a"</code>, <code>"b"</code>, <code>"1"</code>, <code>"."</code>, <code>"#"</code>).</p>
<p>One of the biggest advantages of this approach is that it creates a <strong>very small and universal vocabulary</strong>. Since every possible character is part of the vocabulary, the model <strong>never encounters unknown tokens</strong>, making this method especially robust for handling typos, rare words, or even code and emojis.</p>
<p>However, character-level tokenization comes with its challenges. Because each word is split into many small units, the resulting <strong>input sequences become significantly longer</strong>. For example, the word <code>"movie"</code> would be tokenized as <code>["m", "o", "v", "i", "e"]</code>. As a result, the model must process more tokens per sentence and work harder to capture meaningful patterns and long-range dependencies. This often leads to <strong>slower training</strong> and <strong>higher computational cost</strong>, and can make learning <strong>semantic relationships</strong> more difficult compared to word or subword-level tokenization.</p>
<p>Despite these limitations, character-level tokenization can be useful in specific applications like <strong>morphologically rich languages</strong>, <strong>noisy text</strong>, or <strong>low-resource settings</strong>, where other tokenization methods may fail.</p>
<h2 id="heading-challenges-in-building-a-vocabulary-for-tokenization">Challenges in Building a Vocabulary for Tokenization</h2>
<p>Designing an effective vocabulary is not as straightforward as it might seem. It involves a delicate balance between coverage, efficiency, and model performance. Here are some of the major challenges faced when building vocabularies for Transformer models:</p>
<hr />
<h3 id="heading-a-what-should-be-the-size-of-the-vocabulary">A. What Should Be the Size of the Vocabulary?</h3>
<p>Choosing the <strong>right vocabulary size</strong> is a trade-off. A larger vocabulary gives the model access to more complete word representations, reducing the number of unknown or broken-down tokens. However, it comes at a cost:</p>
<ul>
<li><p>A larger vocabulary means a <strong>bigger embedding matrix</strong>, which increases memory usage.</p>
</li>
<li><p>It also adds <strong>computational overhead</strong> during the softmax operation in the output layer.</p>
</li>
</ul>
<p>On the other hand, a smaller vocabulary saves space and speeds up computation but increases the risk of splitting words too aggressively or encountering unknown words.</p>
<p><strong>Key Question:</strong><br /><mark>What is the optimal vocabulary size that balances memory, speed, and accuracy?</mark></p>
<hr />
<h3 id="heading-b-out-of-vocabulary-oov-words">B. Out-of-Vocabulary (OOV) Words</h3>
<p>When the vocabulary is restricted—say, from 250,000 tokens down to 50,000—there will inevitably be words in the input text that are <strong>not in the vocabulary</strong>. These are known as <strong>out-of-vocabulary (OOV)</strong> words.</p>
<p>If the tokenizer cannot recognize a word, it may replace it with a generic <code>[UNK]</code> (unknown) token. This leads to <strong>loss of information</strong>, especially in tasks like sentiment analysis or translation where specific words carry important meaning.</p>
<p><strong>Solution:</strong><br />Modern models use <strong>subword tokenization</strong>, which breaks unknown words into smaller, known parts, ensuring no word is truly “unknown.”</p>
<hr />
<h3 id="heading-c-handling-misspelled-words-in-the-corpus">C. Handling Misspelled Words in the Corpus</h3>
<p>Text corpora are often scraped from the web, and naturally, they include <strong>spelling mistakes</strong>, typos, or non-standard usage. If these mistakes are treated as valid vocabulary entries, they unnecessarily increase the vocabulary size and reduce its quality.</p>
<p>For example, treating <code>"enjooyed"</code> (a misspelled form of “enjoyed”) as a unique token is unhelpful and inefficient.</p>
<p><strong>Solution:</strong><br />Preprocessing should include <strong>spell correction or normalization</strong> steps. Additionally, tokenizers that operate at subword or character level are better equipped to handle such noisy inputs.</p>
<hr />
<h3 id="heading-d-the-open-vocabulary-problem">D. The Open Vocabulary Problem</h3>
<p>Languages are constantly evolving, and new words are coined regularly—especially in <strong>agglutinative languages</strong> (like Turkish or Finnish), where new words can be formed by combining roots and suffixes.</p>
<p>This means that in theory, the number of possible words in a language is <strong>infinite</strong>. A fixed vocabulary can never fully cover this. This is known as the <strong>open vocabulary problem</strong>.</p>
<p><strong>Solution:</strong><br />Subword-based approaches can adapt to this scenario by <strong>constructing new words</strong> from known pieces. This makes models more flexible and better suited for tasks like <strong>machine translation</strong>, where word diversity is high.</p>
<hr />
<h3 id="heading-final-thought">Final Thought</h3>
<p>The challenges in vocabulary design underscore <strong>why tokenization is such a critical part</strong> of NLP pipelines. The better we handle these edge cases, the more robust and scalable our language models become.</p>
<p>Now, let’s explore subword tokenization — a powerful technique designed to address these vocabulary challenges effectively.</p>
<h1 id="heading-subword-tokenization-the-best-of-both-worlds">Subword Tokenization: The Best of Both Worlds</h1>
<p>Subword tokenization strikes a powerful balance between <strong>word-level</strong> and <strong>character-level</strong> tokenization. Instead of treating every word as a whole or breaking everything down into single characters, subword tokenizers split text into <strong>frequent and meaningful fragments</strong>, known as <em>subwords</em>.</p>
<p>The vocabulary in this method is <strong>moderately sized</strong>, carefully built by scanning the entire corpus for <strong>frequently occurring patterns</strong>. Common words are preserved as full tokens (e.g., <code>"know"</code>, <code>"movie"</code>), while <strong>rare or complex words</strong> are broken into smaller, reusable subunits (e.g., <code>"don’t"</code> becomes <code>"do"</code> and <code>"n’t"</code>). This allows the model to represent virtually <strong>any word</strong>, including those it has never seen before, without resorting to an <code>[UNK]</code> token.</p>
<p>For example, in the sentence:</p>
<blockquote>
<p><em>"Hmm.. I know, I don't know"</em></p>
</blockquote>
<p>The subword tokenizer may rely on a vocabulary like:<br /><code>V = {Hmm.., I, know, do, n't}</code><br />and tokenize the sentence as:<br /><code>[Hmm.., I, know, I, do, n't, know]</code></p>
<p>Even though <code>"don't"</code> wasn’t stored as a single token, the tokenizer successfully reconstructs it using known parts. This enables <strong>flexibility</strong>, <strong>compact vocabulary</strong>, and <strong>robust handling of unseen words</strong>—one reason why subword tokenization is the <strong>preferred method in modern LLMs</strong> like BERT, GPT, and LLaMA.</p>
<p>Subword tokenization is not a one-size-fits-all solution. While its core goal remains the same—<strong>balancing vocabulary size with coverage and flexibility</strong>—different algorithms approach this goal in different ways. Each method has its own strategy for identifying meaningful subword units based on <strong>frequency patterns, data statistics, or probabilistic modeling</strong>.</p>
<p>In modern NLP, three subword tokenization techniques stand out as the most widely used:</p>
<ul>
<li><p><strong>Byte Pair Encoding (BPE)</strong></p>
</li>
<li><p><strong>WordPiece Tokenization</strong></p>
</li>
<li><p><strong>SentencePiece (Unigram) Tokenization</strong></p>
</li>
</ul>
<p>In the sections that follow, we'll explore how each of these methods works and why they’re favored in popular models like GPT, BERT, and T5.</p>
<h2 id="heading-byte-pair-encoding-bpe"><strong>Byte Pair Encoding (BPE)</strong></h2>
<p><strong>Byte Pair Encoding (BPE)</strong> is one of the earliest and most popular subword tokenization algorithms used in modern NLP. Originally developed for data compression, BPE was adapted for tokenization to help models handle rare and unseen words more effectively without bloating the vocabulary.</p>
<h3 id="heading-how-bpe-works">How BPE Works</h3>
<ol>
<li><p><strong>Initialize the Vocabulary</strong>: Each word is split into characters, and each word ends with a special end-of-word symbol like <code>&lt;/w&gt;</code>.</p>
</li>
<li><p><strong>Count Pairs</strong>: Count all adjacent symbol pairs across the vocabulary.</p>
</li>
<li><p><strong>Merge Most Frequent Pair</strong>: Find the most frequent pair and merge it into a new symbol.</p>
</li>
<li><p><strong>Repeat</strong>: Continue merging for a fixed number of iterations or until no more merges are possible.</p>
</li>
</ol>
<h2 id="heading-byte-pair-encoding-bpe-step-by-step-algorithm">Byte Pair Encoding (BPE) — Step-by-Step Algorithm</h2>
<p>Here's how the BPE algorithm works in practice:</p>
<h3 id="heading-step-by-step-breakdown">Step-by-Step Breakdown:</h3>
<ol>
<li><p><strong>Initialize the Vocabulary</strong><br /> Start with a dictionary where each key is a word and the value is its frequency (how often it appears in the corpus).</p>
</li>
<li><p><strong>Mark Word Boundaries</strong><br /> Append a special token <code>&lt;/w&gt;</code> at the end of each word to indicate word boundaries. This helps distinguish between subwords across different words (e.g., <code>low</code> vs. <code>lower</code>).</p>
</li>
<li><p><strong>Set Merge Count</strong><br /> Decide how many merges to perform — this number is a hyperparameter (<code>num_merges</code>) that controls the final vocabulary size.</p>
</li>
<li><p><strong>Build Symbol Pairs Table</strong><br /> Break each word into individual characters and track the frequency of every adjacent symbol pair (like <code>l o</code>, <code>o w</code>, etc.).</p>
</li>
<li><p><strong>Find Most Frequent Pair</strong><br /> Identify the symbol pair that appears most frequently across all words in the vocabulary.</p>
</li>
<li><p><strong>Merge the Pair</strong><br /> Replace every occurrence of that symbol pair with a new merged token (e.g., <code>l o</code> becomes <code>lo</code>).</p>
</li>
<li><p><strong>Repeat</strong><br /> Go back to Step 4 and repeat the process for the number of times specified in the merge count.</p>
</li>
</ol>
<h3 id="heading-python-implementation"><strong>Python Implementation</strong></h3>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> re, collections

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_stats</span>(<span class="hljs-params">vocab</span>):</span>
    pairs = collections.defaultdict(int)
    <span class="hljs-keyword">for</span> word, freq <span class="hljs-keyword">in</span> vocab.items():
        symbols = word.split()
        <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(symbols) - <span class="hljs-number">1</span>):
            pairs[symbols[i], symbols[i + <span class="hljs-number">1</span>]] += freq
    <span class="hljs-keyword">return</span> pairs

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">merge_vocab</span>(<span class="hljs-params">pair, v_in</span>):</span>
    v_out = {}
    bigram = re.escape(<span class="hljs-string">' '</span>.join(pair))
    p = re.compile(<span class="hljs-string">r'(?&lt;!\S)'</span> + bigram + <span class="hljs-string">r'(?!\S)'</span>)
    <span class="hljs-keyword">for</span> word <span class="hljs-keyword">in</span> v_in:
        w_out = p.sub(<span class="hljs-string">''</span>.join(pair), word)
        v_out[w_out] = v_in[word]
    <span class="hljs-keyword">return</span> v_out

<span class="hljs-comment"># Initial vocabulary: words split into characters</span>
vocab = {
    <span class="hljs-string">'l o w &lt;/w&gt;'</span>: <span class="hljs-number">5</span>,
    <span class="hljs-string">'l o w e r &lt;/w&gt;'</span>: <span class="hljs-number">2</span>,
    <span class="hljs-string">'n e w e s t &lt;/w&gt;'</span>: <span class="hljs-number">6</span>,
    <span class="hljs-string">'w i d e s t &lt;/w&gt;'</span>: <span class="hljs-number">3</span>
}

num_merges = <span class="hljs-number">10</span>

<span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(num_merges):
    pairs = get_stats(vocab)
    best = max(pairs, key=pairs.get)
    print(<span class="hljs-string">f"Step <span class="hljs-subst">{i+<span class="hljs-number">1</span>}</span>: Merging pair <span class="hljs-subst">{best}</span>"</span>)
    vocab = merge_vocab(best, vocab)
</code></pre>
<p><strong>Sample Output</strong></p>
<pre><code class="lang-python">Step <span class="hljs-number">1</span>: Merging pair (<span class="hljs-string">'e'</span>, <span class="hljs-string">'s'</span>)
Step <span class="hljs-number">2</span>: Merging pair (<span class="hljs-string">'es'</span>, <span class="hljs-string">'t'</span>)
Step <span class="hljs-number">3</span>: Merging pair (<span class="hljs-string">'n'</span>, <span class="hljs-string">'e'</span>)
...
</code></pre>
<p>To understand BPE in action, let’s use the following sentence as our example:</p>
<blockquote>
<p><strong>"knowing the name of something is different from knowing something. knowing something about everything isn't bad"</strong></p>
</blockquote>
<h3 id="heading-step-1-count-word-frequencies">Step 1: Count Word Frequencies</h3>
<p>The very first step is to tokenize the sentence into words and count their occurrences. Each word is represented as a sequence of characters, with a special <code>&lt;/w&gt;</code> symbol added at the end to mark the word boundary.</p>
<h3 id="heading-word-frequency-table">Word Frequency Table</h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Word</td><td>Frequency</td></tr>
</thead>
<tbody>
<tr>
<td><code>k n o w i n g &lt;/w&gt;</code></td><td>3</td></tr>
<tr>
<td><code>t h e &lt;/w&gt;</code></td><td>1</td></tr>
<tr>
<td><code>n a m e &lt;/w&gt;</code></td><td>1</td></tr>
<tr>
<td><code>o f &lt;/w&gt;</code></td><td>1</td></tr>
<tr>
<td><code>s o m e t h i n g &lt;/w&gt;</code></td><td>2</td></tr>
<tr>
<td><code>i s &lt;/w&gt;</code></td><td>1</td></tr>
<tr>
<td><code>d i f f e r e n t &lt;/w&gt;</code></td><td>1</td></tr>
<tr>
<td><code>f r o m &lt;/w&gt;</code></td><td>1</td></tr>
<tr>
<td><code>s o m e t h i n g . &lt;/w&gt;</code></td><td>1</td></tr>
<tr>
<td><code>a b o u t &lt;/w&gt;</code></td><td>1</td></tr>
<tr>
<td><code>e v e r y t h i n g &lt;/w&gt;</code></td><td>1</td></tr>
<tr>
<td><code>i s n ' t &lt;/w&gt;</code></td><td>1</td></tr>
<tr>
<td><code>b a d &lt;/w&gt;</code></td><td>1</td></tr>
</tbody>
</table>
</div><h2 id="heading-step-2-compute-initial-token-frequencies">Step 2: Compute Initial Token Frequencies</h2>
<p>After splitting each word into characters, we calculate how frequently each character (symbol) appears in the entire vocabulary. This helps us understand which character pairs are most common — the core idea behind Byte Pair Encoding.</p>
<h3 id="heading-character-frequency-table">Character Frequency Table</h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Character</td><td>Frequency</td><td>Explanation</td></tr>
</thead>
<tbody>
<tr>
<td><code>k</code></td><td>3</td><td>Appears in 3 instances of "knowing"</td></tr>
<tr>
<td><code>n</code></td><td>13</td><td>Appears in "knowing", "name", "something", "isn't", "everything", etc.</td></tr>
<tr>
<td><code>o</code></td><td>9</td><td>Found in "knowing", "of", "something", etc.</td></tr>
<tr>
<td><code>i</code></td><td>10</td><td>Common in "knowing", "is", "different", etc.</td></tr>
<tr>
<td><code>g</code></td><td>7</td><td>Seen in "knowing", "something", "everything"</td></tr>
<tr>
<td><code>&lt;/w&gt;</code></td><td>16</td><td>One per word (13 words) + punctuation (like <code>"something.&lt;/w&gt;"</code>)</td></tr>
<tr>
<td>...</td><td>...</td><td>Other characters similarly counted</td></tr>
</tbody>
</table>
</div><p><strong>Initial Vocabulary Size:</strong><br />Total number of unique individual symbols (characters) = <strong>22</strong></p>
<h2 id="heading-step-3-count-frequency-of-symbol-pairs-byte-pairs">Step 3: Count Frequency of Symbol Pairs (Byte-Pairs)</h2>
<p>After building the initial vocabulary of characters, the next step in the BPE algorithm is to count how often <strong>adjacent character pairs</strong> (or byte-pairs) occur across all words. These are candidates for merging.</p>
<h3 id="heading-byte-pair-frequency-table">Byte-Pair Frequency Table</h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Symbol Pair</td><td>Frequency</td><td>Explanation</td></tr>
</thead>
<tbody>
<tr>
<td>(<code>k</code>, <code>n</code>)</td><td>3</td><td>From "knowing" ×3</td></tr>
<tr>
<td>(<code>n</code>, <code>o</code>)</td><td>3</td><td>From "knowing", "something", "not"</td></tr>
<tr>
<td>(<code>o</code>, <code>w</code>)</td><td>3</td><td>From "knowing", "something", etc.</td></tr>
<tr>
<td>(<code>w</code>, <code>i</code>)</td><td>3</td><td>Appears in “knowing”</td></tr>
<tr>
<td>(<code>i</code>, <code>n</code>)</td><td>7</td><td>Appears in many words like “knowing”, “isn’t”, “something”</td></tr>
<tr>
<td>(<code>n</code>, <code>g</code>)</td><td>7</td><td>Often ends “-ing”</td></tr>
<tr>
<td>(<code>g</code>, <code>&lt;/w&gt;</code>)</td><td>6</td><td>Word-ending in “knowing”, “something”, etc.</td></tr>
<tr>
<td>(<code>t</code>, <code>h</code>)</td><td>5</td><td>From “the”, “something”, “everything”</td></tr>
<tr>
<td>...</td><td>...</td><td>And so on...</td></tr>
</tbody>
</table>
</div><p>The <strong>most frequent byte-pair</strong> here is <strong>(</strong><code>i</code>, <code>n</code>) with a count of 7. This means <code>"in"</code> appears a lot and should be merged into a single token in the next step.</p>
<h2 id="heading-4merge-the-most-frequent-byte-pair-i-n">4.Merge the Most Frequent Byte-Pair <code>('i', 'n')</code></h2>
<p>The most frequent pair was <code>'i'</code> and <code>'n'</code> with a frequency of 7. So we <strong>merge</strong> all instances of <code>'i'</code> followed by <code>'n'</code> into a single token <code>'in'</code>.</p>
<h3 id="heading-result-after-first-merge">Result After First Merge</h3>
<h4 id="heading-updated-words">Updated Words</h4>
<p>All instances of <code>'i'</code> + <code>'n'</code> are replaced with <code>'in'</code>:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Word Before</td><td>Word After</td></tr>
</thead>
<tbody>
<tr>
<td><code>k n o w i n g &lt;/w&gt;</code></td><td><code>k n o w in g &lt;/w&gt;</code></td></tr>
<tr>
<td><code>s o m e t h i n g &lt;/w&gt;</code></td><td><code>s o m e t h in g &lt;/w&gt;</code></td></tr>
<tr>
<td><code>e v e r y t h i n g &lt;/w&gt;</code></td><td><code>e v e r y t h in g &lt;/w&gt;</code></td></tr>
<tr>
<td><code>i s n ' t &lt;/w&gt;</code></td><td><code>i s in ' t &lt;/w&gt;</code></td></tr>
<tr>
<td>...</td><td>...</td></tr>
</tbody>
</table>
</div><h4 id="heading-vocabulary-update">Vocabulary Update</h4>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Token</td><td>Frequency</td></tr>
</thead>
<tbody>
<tr>
<td><code>in</code></td><td>7</td></tr>
<tr>
<td><code>i</code></td><td>10 - 7 = 3</td></tr>
<tr>
<td><code>n</code></td><td>13 - 7 = 6</td></tr>
<tr>
<td><em>(others remain unchanged)</em></td></tr>
</tbody>
</table>
</div><hr />
<h3 id="heading-table-for-blog-post-merge-state">Table for Blog (Post-Merge State)</h3>
<h4 id="heading-word-list-after-1st-merge">Word List After 1st Merge</h4>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Word</td><td>Frequency</td></tr>
</thead>
<tbody>
<tr>
<td><code>k n o w in g &lt;/w&gt;</code></td><td>3</td></tr>
<tr>
<td><code>t h e &lt;/w&gt;</code></td><td>1</td></tr>
<tr>
<td><code>n a m e &lt;/w&gt;</code></td><td>1</td></tr>
<tr>
<td><code>o f &lt;/w&gt;</code></td><td>1</td></tr>
<tr>
<td><code>s o m e t h in g &lt;/w&gt;</code></td><td>2</td></tr>
<tr>
<td><code>i s &lt;/w&gt;</code></td><td>1</td></tr>
<tr>
<td><code>d i f f e r e n t &lt;/w&gt;</code></td><td>1</td></tr>
<tr>
<td><code>f r o m &lt;/w&gt;</code></td><td>1</td></tr>
<tr>
<td><code>s o m e t h in g . &lt;/w&gt;</code></td><td>1</td></tr>
<tr>
<td><code>a b o u t &lt;/w&gt;</code></td><td>1</td></tr>
<tr>
<td><code>e v e r y t h in g &lt;/w&gt;</code></td><td>1</td></tr>
<tr>
<td><code>i s in ' t &lt;/w&gt;</code></td><td>1</td></tr>
<tr>
<td><code>b a d &lt;/w&gt;</code></td><td>1</td></tr>
</tbody>
</table>
</div><p>Repeat these steps until either the maximum number of merges is reached or there are no more frequent pairs left to merge.</p>
<p><strong>After 45 merges this is what the vocabulary looks like.</strong></p>
<p><strong>Tokens</strong><br /><code>'k'</code><br /><code>'n'</code><br /><code>'o'</code><br /><code>'i'</code><br /><code>'&lt;/w&gt;'</code></p>
<p>….</p>
<p>….</p>
<p>…<br /><code>'in'</code><br /><code>'ing'</code></p>
<p>……</p>
<p>……<br /><code>'knowing&lt;/w&gt;'</code><br /><code>'the&lt;/w&gt;'</code><br /><code>'name&lt;/w&gt;'</code><br /><code>'of&lt;/w&gt;'</code><br /><code>'something&lt;/w&gt;'</code><br /><code>'is&lt;/w&gt;'</code><br /><code>'different&lt;/w&gt;'</code><br /><code>'from&lt;/w&gt;'</code><br /><code>'something.'&lt;/w&gt;'</code><br /><code>'about&lt;/w&gt;'</code><br /><code>'everyth'</code><br /><code>'isn't&lt;/w&gt;'</code><br /><code>'had&lt;/w&gt;'</code></p>
<h2 id="heading-2wordpiece-tokenization-smarter-subwords-for-nlp">2.WordPiece Tokenization: Smarter Subwords for NLP</h2>
<p>After Byte Pair Encoding (BPE), another widely used subword tokenization algorithm is <strong>WordPiece</strong> — the one originally used by <strong>BERT</strong> and other Transformer-based models from Google.</p>
<p>While WordPiece shares some similarity with BPE, it introduces a few <strong>key differences</strong> that make it more linguistically and statistically robust.</p>
<p><em>It breaks down words into the most likely combination of known subword units — even if the full word isn’t in the vocabulary.</em></p>
<h3 id="heading-how-wordpiece-tokenization-works">How WordPiece Tokenization works:</h3>
<p>In Byte Pair Encoding (BPE), we aim to merge the pair of tokens that occurs most frequently in the current vocabulary. This iterative process helps in building subword units that are common in the text corpus, making tokenization more efficient.</p>
<p>But what happens when <strong>multiple token pairs occur with the same frequency</strong>?</p>
<h4 id="heading-the-tie-breaker-problem">The Tie-Breaker Problem</h4>
<p>Let’s consider this example from the table:</p>
<ul>
<li><p>Pair <code>('i', 'n')</code> occurs <strong>7 times</strong></p>
</li>
<li><p>Pair <code>('n', 'g')</code> also occurs <strong>7 times</strong></p>
</li>
</ul>
<p>So, <strong>which pair should we merge first</strong>?</p>
<h4 id="heading-the-solution-frequency-aware-scoring">The Solution: Frequency-Aware Scoring</h4>
<p>To resolve ties, BPE introduces a <strong>scoring function</strong> that takes into account not just the frequency of the pair, but also the individual frequencies of the tokens involved.</p>
<p>The score is calculated as:</p>
<p>$$\text{score} = \frac{\text{count}(\alpha, \beta)}{\text{count}(\alpha) \cdot \text{count}(\beta)}$$</p><p>Where:</p>
<p>$$\begin{aligned} \alpha, \beta &amp; \quad \text{are the tokens in the pair} \\ \text{count}(\alpha, \beta) &amp; \quad \text{is the frequency of the token pair} \\ \text{count}(\alpha),\ \text{count}(\beta) &amp; \quad \text{are the frequencies of the individual tokens} \end{aligned}$$</p><p><strong>Why This Works</strong></p>
<p>This method <strong>favors merging rare tokens</strong>, under the assumption that rare token pairs form more meaningful new units. If both tokens in a pair are common, their product count(α)⋅count(β)\text{count}(\alpha) \cdot \text{count}(\beta)count(α)⋅count(β) will be high, which <strong>lowers the score</strong>.</p>
<p>Hence, we select the pair with the <strong>highest score</strong>—which often means the most contextually relevant and least redundant merge.</p>
<h4 id="heading-example-table">Example Table</h4>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Token Pair</td><td>Frequency</td></tr>
</thead>
<tbody>
<tr>
<td>('i', 'n')</td><td>7</td></tr>
<tr>
<td>('n', 'g')</td><td>7</td></tr>
<tr>
<td>('t', 'h')</td><td>5</td></tr>
<tr>
<td>('k', 'n')</td><td>3</td></tr>
<tr>
<td>('n', 'o')</td><td>3</td></tr>
<tr>
<td>...</td><td>...</td></tr>
</tbody>
</table>
</div><p>Even though <code>('i', 'n')</code> and <code>('n', 'g')</code> have the same frequency, the one with the <strong>lower individual token frequencies</strong> will be prioritized.</p>
<h2 id="heading-example-table-from-corpus">Example Table from Corpus</h2>
<p>Here is a sample score table from a BPE step:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Token Pair</td><td>Pair Freq</td><td>Freq(α)</td><td>Freq(β)</td><td>Score</td></tr>
</thead>
<tbody>
<tr>
<td>('k', 'n')</td><td>3</td><td>3</td><td>13</td><td>0.076</td></tr>
<tr>
<td>('n', 'o')</td><td>3</td><td>13</td><td>9</td><td>0.020</td></tr>
<tr>
<td>('o', 'w')</td><td>3</td><td>9</td><td>3</td><td>0.111</td></tr>
<tr>
<td>('w', 'i')</td><td>3</td><td>–</td><td>–</td><td>–</td></tr>
<tr>
<td>('i', 'n')</td><td>7</td><td>10</td><td>13</td><td>0.050</td></tr>
<tr>
<td>('n', 'g')</td><td>7</td><td>13</td><td>7</td><td>0.076</td></tr>
<tr>
<td>('g', '.')</td><td>1</td><td>–</td><td>–</td><td>–</td></tr>
<tr>
<td>('t', 'h')</td><td>5</td><td>8</td><td>5</td><td>0.125</td></tr>
<tr>
<td>('h', 'e')</td><td>1</td><td>–</td><td>–</td><td>–</td></tr>
<tr>
<td>('e', '&lt;/w&gt;')</td><td>2</td><td>–</td><td>–</td><td>–</td></tr>
<tr>
<td>('a', 'd')</td><td>1</td><td>3</td><td>2</td><td><strong>0.16</strong></td></tr>
</tbody>
</table>
</div><blockquote>
<p><strong>Merge Decision</strong>: The pair <code>('a', 'd')</code> has the <strong>highest score (0.16)</strong> and is chosen for the next merge.</p>
</blockquote>
<h2 id="heading-3sentencepiece-tokenizer-a-modern-subword-tokenizer">3.SentencePiece Tokenizer: A Modern Subword Tokenizer</h2>
<p>Tokenization is a foundational step in natural language processing (NLP), where text is broken into smaller units—typically words or subwords. Traditional approaches rely on whitespace and language-specific rules, but they struggle with multilingual texts, typos, or unseen words.</p>
<p><strong>SentencePiece</strong> offers a flexible and language-independent solution that works directly on raw text, supporting both <strong>BPE</strong> and <strong>Unigram</strong> algorithms.</p>
<h2 id="heading-motivation-why-sentencepiece">Motivation: Why SentencePiece?</h2>
<p>Unlike standard tokenizers that split on spaces, <strong>SentencePiece treats input as a raw character sequence</strong>. This lets it work for languages without spaces (e.g., Japanese, Chinese), and for tasks where space is unreliable (e.g., OCR, code, user queries).</p>
<p>It also supports <strong>probabilistic tokenization</strong>, helping generate multiple tokenization paths—a huge advantage in training robust models.</p>
<hr />
<h2 id="heading-a-word-can-have-many-subword-segmentations">A Word Can Have Many Subword Segmentations</h2>
<p>Let’s explore how SentencePiece (or BPE) handles the word <code>"hello"</code> using a fixed vocabulary.</p>
<p>Assume our vocabulary is:</p>
<p>$$\mathcal{V} = \{ h, e, l, o, he, el, lo, ll, hell \}$$</p><p>There are multiple valid segmentations for the word <code>"hello"</code>:</p>
<p>$$\begin{aligned} \mathbf{x_1} &amp;= \text{he},\ \text{ll},\ \text{o} \\ \mathbf{x_2} &amp;= \text{h},\ \text{el},\ \text{lo} \\ \mathbf{x_3} &amp;= \text{he},\ \text{l},\ \text{lo} \\ \mathbf{x_4} &amp;= \text{hell},\ \text{o} \end{aligned}$$</p><p>Even though all are valid, <strong>BPE is greedy and deterministic</strong>, so it will pick just one—typically the leftmost or longest match first.<br /><strong>Output</strong> (deterministic):</p>
<pre><code class="lang-plaintext">he, l, lo
</code></pre>
<hr />
<h2 id="heading-greedy-vs-probabilistic">Greedy vs. Probabilistic</h2>
<p>SentencePiece can go beyond greedy selection. In <strong>probabilistic mode</strong>, it samples among possible subword segmentations.</p>
<h3 id="heading-greedy">Greedy:</h3>
<ul>
<li><p>Always returns the same subword split.</p>
</li>
<li><p>Fast, reproducible.</p>
</li>
<li><p>Good for inference.</p>
</li>
</ul>
<h3 id="heading-probabilistic">Probabilistic:</h3>
<ul>
<li><p>Returns different valid segmentations each time.</p>
</li>
<li><p>Useful for <strong>data augmentation</strong> during training.</p>
</li>
<li><p>Enabled via <strong>BPE-Dropout</strong> or <strong>Unigram LM sampling</strong>.</p>
</li>
</ul>
<hr />
<h2 id="heading-probabilistic-tokenization-objective">Probabilistic Tokenization Objective</h2>
<p>The goal is to find the <strong>best segmentation</strong> <code>x*</code> from all possible segmentations <code>{ x1, x2, ..., xk }</code> that maximizes the likelihood of generating the observed word <code>X</code>:</p>
<p>$$\mathbf{x^*} = \arg\max_{\mathbf{x}} \Pr(\mathbf{x} \mid X)$$</p><p><strong>Where:</strong></p>
<ul>
<li><p><code>X</code>: the word (i.e., a sequence of characters)</p>
</li>
<li><p><code>x</code>: a possible segmentation of <code>X</code> into subwords</p>
</li>
</ul>
<p>The <strong>tokenizer</strong> models this process as a <strong>Hidden Markov Model (HMM)</strong>, where the observed word <code>X</code> can correspond to many possible hidden tokenization paths.</p>
<p><em>SentencePiece’s</em> <strong><em>Unigram Language Model</em></strong> <em>is particularly powerful in capturing this uncertainty during training.</em></p>
<h3 id="heading-explanation">Explanation:</h3>
<p>Let <code>x</code> denote a <strong>subword sequence</strong> of length <code>n</code>:</p>
<p>$$x = (x_1, x_2, \ldots, x_n)$$</p><p>The probability of this subword sequence under a <strong>Unigram Language Model</strong> is computed as the <strong>product of individual subword probabilities</strong>:</p>
<p>$$P(x) = \prod_{i=1}^{n} P(x_i)$$</p><p>And since it's a probability distribution, it must satisfy:</p>
<p>$$\sum_{x \in V} P(x) = 1$$</p><h3 id="heading-objective-find-the-most-likely-segmentation">Objective: Find the Most Likely Segmentation</h3>
<p>Given an input sequence <code>X</code>, the objective is to find the <strong>best subword segmentation</strong> <code>x*</code> from all possible segmentations <code>S(X)</code> that <strong>maximizes the sequence probability</strong>:</p>
<p>$$x^* = \arg\max_{x \in S(X)} P(x)$$</p><p>This is a classic <strong>maximum likelihood problem</strong>. To solve it efficiently, <strong>SentencePiece uses Viterbi decoding</strong> to select the best possible path from the segmentation space.</p>
<hr />
<h3 id="heading-dataset-wide-likelihood-function">Dataset-Wide Likelihood Function</h3>
<p>To train the model, we want to maximize the <strong>log-likelihood</strong> over an entire dataset <code>D</code>. That is:</p>
<p>$$L = \sum_{s=1}^{|D|} \log P(X_s)$$</p><p>But since each <code>Xₛ</code> has many valid subword segmentations, we <strong>marginalize over all of them</strong>:</p>
<p>$$L = \sum_{s=1}^{|D|} \log \left( \sum_{x \in S(X_s)} P(x) \right)$$</p><p>This introduces <strong>hidden variables</strong>—the true subword paths that generated each observed word are unknown.</p>
<hr />
<h3 id="heading-em-algorithm-to-the-rescue">EM Algorithm to the Rescue</h3>
<p>Since we’re working with hidden subword paths, <strong>SentencePiece applies the Expectation-Maximization (EM) algorithm</strong>:</p>
<ul>
<li><p><strong>E-Step</strong>: Estimate the contribution (responsibility) of each possible subword segmentation.</p>
</li>
<li><p><strong>M-Step</strong>: Update the model to better fit the expected probabilities.</p>
</li>
</ul>
<p>This iterative process continues until the model converges, leading to highly optimized subword vocabularies.</p>
<h3 id="heading-example-1">Example</h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Word</td><td>Frequency</td></tr>
</thead>
<tbody>
<tr>
<td>'knowing'</td><td>3</td></tr>
<tr>
<td>'the'</td><td>1</td></tr>
<tr>
<td>'name'</td><td>1</td></tr>
<tr>
<td>'of'</td><td>1</td></tr>
<tr>
<td>'something'</td><td>2</td></tr>
<tr>
<td>'is'</td><td>1</td></tr>
<tr>
<td>'different'</td><td>1</td></tr>
<tr>
<td>'from'</td><td>1</td></tr>
<tr>
<td>'something.'</td><td>1</td></tr>
<tr>
<td>'about'</td><td>1</td></tr>
<tr>
<td>'everything'</td><td>1</td></tr>
<tr>
<td>"isn't"</td><td>1</td></tr>
<tr>
<td>'bad'</td><td>1</td></tr>
</tbody>
</table>
</div><p>Let’s say we want to tokenize the word:</p>
<p>$$X = \text{"knowing"}$$</p><p>Now, suppose the model provides the following segmentation candidates:</p>
<p>$$S(X) = \left\{ \{ \text{"k"}, \text{"now"}, \text{"ing"} \},\ \{ \text{"know"}, \text{"ing"} \},\ \{ \text{"knowing"} \} \right\}$$</p><p>These are three different ways the word "knowing" could be split into subwords.</p>
<h3 id="heading-segment-probabilities-from-vocabulary">Segment Probabilities from Vocabulary</h3>
<p>Based on the model’s learned vocabulary we can assign probabilities to each token. The probability of a full segmentation is the product of its subwords' probabilities:</p>
<h4 id="heading-option-1">Option 1:</h4>
<p>$$x_1 = \{ \text{"k"}, \text{"now"}, \text{"ing"} \}$$</p><p>$$P(x_1) = P(\text{"k"}) \times P(\text{"now"}) \times P(\text{"ing"}) = \frac{3}{16} \times \frac{3}{16} \times \frac{7}{16} = \frac{63}{4096}$$</p><p><strong>Option 2:</strong></p>
<p>$$x_2 = \{ \text{"know"}, \text{"ing"} \}$$</p><p>$$P(x_1) = P(\text{"k"}) \times P(\text{"now"}) \times P(\text{"ing"}) = \frac{3}{16} \times \frac{3}{16} \times \frac{7}{16} = \frac{63}{4096}$$</p><h4 id="heading-option-3">Option 3:</h4>
<p>$$x_3 = \{ \text{"knowing"} \}$$</p><p>$$P(x_3) = P(\text{"knowing"}) = \frac{768}{4096}$$</p><h3 id="heading-best-segmentation-maximum-likelihood">Best Segmentation (Maximum Likelihood)</h3>
<p>Comparing the three probabilities:</p>
<ul>
<li><p><code>x₁</code>: 63 / 4096</p>
</li>
<li><p><code>x₂</code>: 336 / 4096</p>
</li>
<li><p><code>x₃</code>: <strong>768 / 4096</strong> ← <em>highest</em></p>
</li>
</ul>
<p>Thus, the model selects:</p>
<p>$$x^* = \arg\max_{x \in S(X)} P(x) = x_3$$</p><p>Which means SentencePiece will tokenize <code>"knowing"</code> as a single subword <code>"knowing"</code> based on the learned probabilities.</p>
<h3 id="heading-algorithm">Algorithm</h3>
<p>The algorithm works using <strong>Expectation-Maximization (EM)</strong> steps, combined with a vocabulary pruning step to iteratively reduce subwords that contribute the least to the model likelihood.</p>
<hr />
<h3 id="heading-step-by-step-algorithm">Step-by-Step Algorithm</h3>
<ol>
<li><p><strong>Initialize Vocabulary</strong><br /> Construct a reasonably large seed vocabulary using methods like <strong>Byte-Pair Encoding (BPE)</strong> or <strong>Extended Suffix Array</strong>. This gives you an overcomplete set of subwords to start with.</p>
</li>
<li><p><strong>E-Step (Expectation Step)</strong><br /> Estimate the <strong>probability of each subword/token</strong> in the current vocabulary.<br /> This is typically done using frequency counts from the training corpus.</p>
</li>
<li><p><strong>M-Step (Maximization Step)</strong><br /> Use the <strong>Viterbi algorithm</strong> to segment the corpus.<br /> Find the best subword sequence for each word that <strong>maximizes the (log) likelihood</strong> under the current subword probabilities.</p>
</li>
<li><p><strong>Compute Likelihoods</strong><br /> For each subword, compute its <strong>contribution to the total corpus likelihood</strong> based on how often it appears in the optimal segmentations.</p>
</li>
<li><p><strong>Prune Vocabulary</strong><br /> Identify and remove the top <strong>x% of subwords</strong> with the <strong>smallest likelihood contribution</strong>.<br /> This reduces the vocabulary size gradually while preserving the most informative subwords.</p>
</li>
<li><p><strong>Repeat</strong><br /> Go back to <strong>Step 2</strong> and repeat the process until the vocabulary size reaches the desired target.</p>
</li>
</ol>
<h3 id="heading-pseudocode">Pseudocode</h3>
<pre><code class="lang-python">initialize_vocab = build_seed_vocab(method=<span class="hljs-string">"BPE"</span>, size=large)
<span class="hljs-keyword">while</span> len(initialize_vocab) &gt; target_vocab_size:
    <span class="hljs-comment"># E-step</span>
    probabilities = estimate_token_probabilities(initialize_vocab, corpus)

    <span class="hljs-comment"># M-step</span>
    segmentations = viterbi_segment(corpus, probabilities)

    <span class="hljs-comment"># Compute likelihoods</span>
    likelihoods = compute_likelihoods(segmentations)

    <span class="hljs-comment"># Prune least useful subwords</span>
    initialize_vocab = prune_vocab(initialize_vocab, likelihoods, top_x_percent_to_remove)
</code></pre>
<h3 id="heading-tokenization-techniques-comparison-table">Tokenization Techniques Comparison Table</h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Tokenizer Type</td><td>Used In Models</td><td>Pros</td><td>Cons</td></tr>
</thead>
<tbody>
<tr>
<td><strong>Word-level</strong></td><td>None in modern LLMs</td><td>- Simple and intuitive- Easy to implement</td><td>- Cannot handle OOV words- Large vocab size</td></tr>
<tr>
<td><strong>Character-level</strong></td><td>Some RNNs, early models</td><td>- Tiny vocab- No OOV- Handles misspellings</td><td>- Very long sequences- Hard to learn meaning</td></tr>
<tr>
<td><strong>BPE</strong></td><td>GPT-2, GPT-3, RoBERTa, LLaMA</td><td>- Efficient- Reuses frequent subwords- No <code>[UNK]</code> token</td><td>- Greedy merges- Not probabilistic</td></tr>
<tr>
<td><strong>WordPiece</strong></td><td>BERT, DistilBERT</td><td>- Good balance of vocab vs coverage- Handles rare words well</td><td>- Slightly slower due to scoring- Needs pretraining</td></tr>
<tr>
<td><strong>SentencePiece</strong></td><td>T5, ALBERT, mT5</td><td>- No whitespace dependence- Supports multilingual corpora- Probabilistic</td><td>- Slightly complex training- May split too aggressively</td></tr>
</tbody>
</table>
</div><p><strong>Quick Notes</strong>:</p>
<ul>
<li><p>Subword tokenizers like <strong>BPE</strong>, <strong>WordPiece</strong>, and <strong>SentencePiece</strong> are preferred in <strong>modern Transformers</strong> due to their balance of efficiency and generalization.</p>
</li>
<li><p>Pure <strong>word-level</strong> and <strong>character-level</strong> tokenization are mostly outdated, though they still have niche applications.</p>
</li>
</ul>
]]></content:encoded></item></channel></rss>