Anand Sukumaran

Startup Founder, Software Engineer, Abstract thinker
Co-founder & CTO @ Engagespot (Techstars NYC '24)

Understanding How LLMs Count Letters - A Deep Dive into Tokenization and Its Implications

May 06, 2025
facebook-white sharing button
twitter-white sharing button
email-white sharing button
whatsapp-white sharing button
sharethis-white sharing button

Large Language Models (LLMs) like GPT-4 and Gemini have amazed us with their ability to generate coherent text, answer questions, and even solve complex problems. However, when it comes to simple tasks like counting letters in a word, these models sometimes stumble. In this post, we’ll explore the phenomenon of letter counting in LLMs, share experimental observations, and explain the underlying mechanisms in both layman terms and technical detail.


Table of Contents

  1. Introduction
  2. Background: How LLMs Process Text
  3. The Letter Counting Challenge
  4. Experimental Observations
  5. Discussion: What Do These Results Tell Us?
  6. Conclusion
  7. References & Further Reading

Introduction

Imagine asking a language model to count the number of letters in a word. For a child, this is trivial—they simply look at the word and count each character one by one. But for an LLM, the task is not as straightforward. Despite their high-level language capabilities, these models sometimes produce incorrect counts for tasks like “count the r’s in strawberry”

This blog post documents a series of experiments and discusses why these errors occur, drawing parallels with how humans use multiple sensory inputs to verify what they produce.


Background: How LLMs Process Text

Tokenization: The Building Blocks of LLMs

LLMs do not process text as a continuous stream of characters. Instead, they break text down into tokens—units that can represent whole words, parts of words (subwords), or even punctuation. The tokenization process is essential because:

For example, the word “strawberry” is often seen as a single token in many LLMs because it appears frequently. However, if you modify it to “strawcherryk,” the tokenizer may split it into subword tokens like “straw,” “cherry,” and “k.”

Subword vs. Character-Level Processing

Because most LLMs use subword tokenization, they often lose fine-grained, character-level information—making tasks like counting letters challenging.


The Letter Counting Challenge

When you ask an LLM, “Count the r’s in strawberry,” the model may have seen the token “strawberry” frequently during training and, as a result, “knows” its internal properties (such as the number of r’s). However, if you alter the word so that it’s split into multiple tokens (e.g., “strayberi”), the model has to reassemble its understanding from sub-components. Often, the attention mechanism and learned associations are not designed to handle these character-level tasks, leading to errors.


Experimental Observations

Below is a summary of several experiments conducted to test how different models handle letter counting tasks.

Experiment 1: Counting Characters in “strawcherryk”

Prompt:
Give me the characters in strawcherryk as comma separated list. Just one word answer. No explanation. No google search.

pgsql Copy

ModelOutput (Text Input)Visual Input (Screenshot)Observations
Gemini“Straw,Cherry,K” (worked correctly, showing S,t,r,a,w,c,h,e,r,r,y)CorrectGemini works correctly with the visual input.
GPT4oCorrect for “strawberry” but error when cherry has 3 r’s (returns 4 r’s)CorrectGPT4o’s accuracy depends on the consistency of token splits.

Experiment 2: Counting ‘b’s in “bodbysby”

Prompt:
How many b’s are there in bodbysby? Just give me one word answer. No explanations.

pgsql Copy

ModelOutput (Text Input)Visual Input (Screenshot)Observations
GeminiFails (Answers 2)Correct (Answers 3)Visual input corrects the error seen in text input.
GPT4oFails (Answers 4)Correct (Answers 3)Both models improve with image input, suggesting better extraction of characters.

Experiment 3: Counting ‘r’s in “strayberi”

Prompt:
How many r in strayberi? Just give me one word answer. No explanation.

sql Copy

ModelOutput Without InstructionOutput With Additional Instruction (“Count the word after stray too”)Observations
GeminiFails (Zero)Correct (Two)Indicates reliance on token splitting for correct count.
GPT4oFails (One)Correct (Two)Same behavior; explicit instruction helps reallocate attention.

Experiment 4: Counting ‘r’s in “s t r a w b e r i”

Prompt:
How many r’s in s t r a w b e r i? Just give me one word answer. No explanation.

sql Copy

ModelOutput (Direct)Output (Rephrased: “How many of these letters are r’s? s t r a w b e r i”)Observations
GeminiInconsistent (sometimes 1, sometimes correct)CorrectVariability suggests sensitivity to how the prompt is phrased.
GPT4oFails (Answers 3, likely due to overfitting the “strawberry” association)CorrectRemoving associations (e.g., specifying “not strawberry”) improves results.

Discussion: What Do These Results Tell Us?

Token Associations and Learned Biases

Visual Input vs. Text Input

Analogy: Human Feedback and Cross-Checking

Consider how humans write and proofread their work:

In essence, just as our brains benefit from multiple sensory inputs (e.g., hearing our own voice, seeing our written words), multimodal models that incorporate visual input can overcome some limitations inherent in text-only tokenization.


Conclusion

The experiments discussed in this post highlight a fundamental challenge in how LLMs process text: the reliance on subword tokenization. When a word is seen frequently (like “strawberry”), the model learns its properties as a single token. However, when the word is altered and split into subword tokens, the fine-grained character information is lost, leading to errors in tasks like letter counting.

Moreover, our observation that image inputs yield more accurate results reinforces the idea that the process of converting visual text (via OCR) preserves character-level details that are otherwise obscured by subword tokenization. This finding draws an interesting parallel with human cognition, where multiple sensory inputs (such as visual feedback) help us verify and correct our outputs.

Understanding these limitations not only helps us interpret LLM behavior better but also points toward potential improvements—such as incorporating multimodal inputs or designing tokenization strategies that better preserve character-level information.


References & Further Reading

By understanding the interplay between tokenization, attention, and multimodal processing, we can better appreciate both the strengths and limitations of today’s LLMs. This knowledge not only demystifies seemingly simple tasks like counting letters but also paves the way for future innovations in AI language processing.

facebook-white sharing button
twitter-white sharing button
email-white sharing button
whatsapp-white sharing button
sharethis-white sharing button