Anand Sukumaran

Startup Founder, Software Engineer, Abstract thinker
Co-founder & CTO @ Engagespot (Techstars NYC '24)

Understanding How LLMs Count Letters - A Deep Dive into Tokenization and Its Implications

May 06, 2025

Large Language Models (LLMs) like GPT-4 and Gemini have amazed us with their ability to generate coherent text, answer questions, and even solve complex problems. However, when it comes to simple tasks like counting letters in a word, these models sometimes stumble. In this post, we’ll explore the phenomenon of letter counting in LLMs, share experimental observations, and explain the underlying mechanisms in both layman terms and technical detail.

Introduction
Background: How LLMs Process Text
- Tokenization: The Building Blocks of LLMs
- Subword vs. Character-Level Processing
The Letter Counting Challenge
Experimental Observations
Discussion: What Do These Results Tell Us?
Conclusion
References & Further Reading

Introduction

Imagine asking a language model to count the number of letters in a word. For a child, this is trivial—they simply look at the word and count each character one by one. But for an LLM, the task is not as straightforward. Despite their high-level language capabilities, these models sometimes produce incorrect counts for tasks like “count the r’s in strawberry”

This blog post documents a series of experiments and discusses why these errors occur, drawing parallels with how humans use multiple sensory inputs to verify what they produce.

Background: How LLMs Process Text

Tokenization: The Building Blocks of LLMs

LLMs do not process text as a continuous stream of characters. Instead, they break text down into tokens—units that can represent whole words, parts of words (subwords), or even punctuation. The tokenization process is essential because:

Efficiency: Tokens reduce the complexity of predictions by limiting the number of possible outputs.
Training: Models are trained on these tokens, learning statistical associations based on their frequency in the training data.

For example, the word “strawberry” is often seen as a single token in many LLMs because it appears frequently. However, if you modify it to “strawcherryk,” the tokenizer may split it into subword tokens like “straw,” “cherry,” and “k.”

Subword vs. Character-Level Processing

Subword Tokenization: Groups letters together based on common patterns. This is efficient but means the model “sees” a word as a whole token rather than as individual characters.
Character-Level Processing: Would allow the model to see every single character (e.g., “s”, “t”, “r”, “a”, …), but it is rarely used because it would dramatically increase the sequence length and computational cost.

Because most LLMs use subword tokenization, they often lose fine-grained, character-level information—making tasks like counting letters challenging.

The Letter Counting Challenge

When you ask an LLM, “Count the r’s in strawberry,” the model may have seen the token “strawberry” frequently during training and, as a result, “knows” its internal properties (such as the number of r’s). However, if you alter the word so that it’s split into multiple tokens (e.g., “strayberi”), the model has to reassemble its understanding from sub-components. Often, the attention mechanism and learned associations are not designed to handle these character-level tasks, leading to errors.

Experimental Observations

Below is a summary of several experiments conducted to test how different models handle letter counting tasks.

Experiment 1: Counting Characters in “strawcherryk”

Prompt:
Give me the characters in strawcherryk as comma separated list. Just one word answer. No explanation. No google search.

pgsql Copy

Model	Output (Text Input)	Visual Input (Screenshot)	Observations
Gemini	“Straw,Cherry,K” (worked correctly, showing S,t,r,a,w,c,h,e,r,r,y)	Correct	Gemini works correctly with the visual input.
GPT4o	Correct for “strawberry” but error when cherry has 3 r’s (returns 4 r’s)	Correct	GPT4o’s accuracy depends on the consistency of token splits.

Experiment 2: Counting ‘b’s in “bodbysby”

Prompt:
How many b’s are there in bodbysby? Just give me one word answer. No explanations.

pgsql Copy

Model	Output (Text Input)	Visual Input (Screenshot)	Observations
Gemini	Fails (Answers 2)	Correct (Answers 3)	Visual input corrects the error seen in text input.
GPT4o	Fails (Answers 4)	Correct (Answers 3)	Both models improve with image input, suggesting better extraction of characters.

Experiment 3: Counting ‘r’s in “strayberi”

Prompt:
How many r in strayberi? Just give me one word answer. No explanation.

sql Copy

Model	Output Without Instruction	Output With Additional Instruction (“Count the word after stray too”)	Observations
Gemini	Fails (Zero)	Correct (Two)	Indicates reliance on token splitting for correct count.
GPT4o	Fails (One)	Correct (Two)	Same behavior; explicit instruction helps reallocate attention.

Experiment 4: Counting ‘r’s in “s t r a w b e r i”

Prompt:
How many r’s in s t r a w b e r i? Just give me one word answer. No explanation.

sql Copy

Model	Output (Direct)	Output (Rephrased: “How many of these letters are r’s? s t r a w b e r i”)	Observations
Gemini	Inconsistent (sometimes 1, sometimes correct)	Correct	Variability suggests sensitivity to how the prompt is phrased.
GPT4o	Fails (Answers 3, likely due to overfitting the “strawberry” association)	Correct	Removing associations (e.g., specifying “not strawberry”) improves results.

Discussion: What Do These Results Tell Us?

Token Associations and Learned Biases

Holistic Tokens vs. Subword Splits:
When a common word like “strawberry” is seen as a single token during training, the model learns its properties (e.g., letter counts) holistically. When a word is altered (e.g., “strawcherryk” or “strayberi”), it gets split into parts that the model has to reassemble. If those parts aren’t consistently represented in training data, errors occur.
Attention Allocation:
The experiments (especially Experiment 3) indicate that the attention mechanism may focus primarily on one subword (e.g., “straw”) rather than combining information from both tokens. Explicit prompts that force the model to “count the word after stray too” reorient its attention, leading to correct answers.

Visual Input vs. Text Input

Image-Based Prompts:
When you provide a screenshot of the text, the model’s vision module processes the image using OCR-like techniques. OCR systems typically extract text at a much finer granularity—often preserving character-level details that are lost in subword tokenization. This means the text extracted from an image is closer to a character-level representation, leading to better performance on letter counting.
Bypassing the Tokenization Bottleneck:
The fact that image input generally results in correct answers suggests that the tokenization bottleneck (i.e., the loss of character-level detail) can be overcome when the model processes a visual representation of text.

Analogy: Human Feedback and Cross-Checking

Consider how humans write and proofread their work:

Dual Input: When you write, you not only generate language but also see your text on a screen or paper. This visual feedback allows you to verify and correct mistakes in real time.
Error Correction: Without this external check, you might make errors—similarly, LLMs that process text purely through subword tokenization can miss fine details like individual letter counts.

In essence, just as our brains benefit from multiple sensory inputs (e.g., hearing our own voice, seeing our written words), multimodal models that incorporate visual input can overcome some limitations inherent in text-only tokenization.

Conclusion

The experiments discussed in this post highlight a fundamental challenge in how LLMs process text: the reliance on subword tokenization. When a word is seen frequently (like “strawberry”), the model learns its properties as a single token. However, when the word is altered and split into subword tokens, the fine-grained character information is lost, leading to errors in tasks like letter counting.

Moreover, our observation that image inputs yield more accurate results reinforces the idea that the process of converting visual text (via OCR) preserves character-level details that are otherwise obscured by subword tokenization. This finding draws an interesting parallel with human cognition, where multiple sensory inputs (such as visual feedback) help us verify and correct our outputs.

Understanding these limitations not only helps us interpret LLM behavior better but also points toward potential improvements—such as incorporating multimodal inputs or designing tokenization strategies that better preserve character-level information.

References & Further Reading

Simbian AI Blog Post: “Getting GPT‑4 to Count ‘R’ in Strawberry”
Simbian AI Insights
Research Paper: “Tokenization Falling Short: The Curse of Tokenization”
arXiv:2406.11687
Discussion on LLM Tokenization: Various community posts and practitioner blogs
OCR Systems & Character-Level Recognition: Information on systems like Tesseract and their output granularity

By understanding the interplay between tokenization, attention, and multimodal processing, we can better appreciate both the strengths and limitations of today’s LLMs. This knowledge not only demystifies seemingly simple tasks like counting letters but also paves the way for future innovations in AI language processing.