Token

Simon BudziakCTO
A Token is the fundamental unit of text that a language model processes. Rather than working with individual characters or whole words, LLMs break text into tokens through a process called tokenization.
Tokens can represent whole words, parts of words (subwords), or even individual characters, depending on the model's tokenizer. For example, "unbelievable" might be split into tokens like ["un", "believ", "able"], while common words like "the" typically form single tokens. This approach offers several advantages:
Tokens can represent whole words, parts of words (subwords), or even individual characters, depending on the model's tokenizer. For example, "unbelievable" might be split into tokens like ["un", "believ", "able"], while common words like "the" typically form single tokens. This approach offers several advantages:
- Vocabulary Efficiency: Instead of memorizing millions of unique words, models work with a fixed vocabulary of 50,000-100,000 tokens that can represent any text.
- Handling Unknown Words: New or rare words can be broken into familiar subword tokens, allowing the model to process them meaningfully.
- Multilingual Support: The same tokenizer can handle multiple languages by breaking them into common character sequences.
- Pricing: API costs are calculated per token, not per word (roughly 1 token ≈ 0.75 words in English).
- Context Limits: Model context windows are measured in tokens (e.g., 200K tokens for Claude 4.5), not characters or words.
- Performance: Token count directly impacts processing speed and memory usage.