Tokenization is the process of breaking text into smaller units — tokens — that a language model can process. A token is roughly a word or a word-piece; “visibility” might be one token, while a longer or rarer word splits into several.
Why it matters
Models don’t read characters or words the way people do — they read tokens. This has practical consequences:
- Limits are measured in tokens. A model’s context window — how much it can consider at once — is counted in tokens, not pages.
- Cost and speed scale with tokens. Longer inputs and outputs mean more tokens to process.
- It shapes comprehension. How text is tokenized affects how the model interprets it, which is one reason clear, well-structured content is easier for models to use.
For AEO, you rarely interact with tokenization directly, but it underpins why concise, clearly written content is processed more reliably than dense or messy text. See how LLMs learn about brands.