lubu labs

Token

Simon Budziak
Simon BudziakCTO
A Token is the fundamental unit of text that a language model processes. Rather than working with individual characters or whole words, LLMs break text into tokens through a process called tokenization.

Tokens can represent whole words, parts of words (subwords), or even individual characters, depending on the model's tokenizer. For example, "unbelievable" might be split into tokens like ["un", "believ", "able"], while common words like "the" typically form single tokens. This approach offers several advantages:
  • Vocabulary Efficiency: Instead of memorizing millions of unique words, models work with a fixed vocabulary of 50,000-100,000 tokens that can represent any text.
  • Handling Unknown Words: New or rare words can be broken into familiar subword tokens, allowing the model to process them meaningfully.
  • Multilingual Support: The same tokenizer can handle multiple languages by breaking them into common character sequences.
Understanding tokens is critical for working with LLMs because:
  • Pricing: API costs are calculated per token, not per word (roughly 1 token ≈ 0.75 words in English).
  • Context Limits: Model context windows are measured in tokens (e.g., 200K tokens for Claude 4.5), not characters or words.
  • Performance: Token count directly impacts processing speed and memory usage.
Different models use different tokenization schemes (BPE, WordPiece, SentencePiece), which is why the same text may consume different token counts across providers. Common tokenizers include OpenAI's tiktoken (for GPT models) and SentencePiece (for many open-source models).

Ready to Build with AI?

Lubu Labs specializes in building advanced AI solutions for businesses. Let's discuss how we can help you leverage AI technology to drive growth and efficiency.