lubu labs
Back to Blog
LangChain

Prompt Caching Across Claude, GPT, and Gemini = Stop Paying for the Same Tokens Twice

Cut LLM API costs up to 90% by caching repeated context. How prompt caching works across Claude, GPT, and Gemini - and when it doesn't.

Simon Budziak
Simon Budziak
CTO
Prompt Caching Across Claude, GPT, and Gemini = Stop Paying for the Same Tokens Twice

We built a document analysis pipeline for a legal-tech client. The workflow processed contracts and compliance documents — the agent had a 600-token system prompt defining its role and output format, plus 12,000–15,000 tokens of document context injected on every request. A typical session: ten questions per document, batched across twenty documents.

By request three, we were billing the same 15,000 tokens three times. By request ten, ten times. The document hadn't changed. Only the question had.

In the previous post on model.profile, we used LangChain's model capability metadata to adapt context management to the active model — triggering summarization based on the model's actual window rather than a hardcoded threshold. This post is the companion piece: once you're managing context intelligently, the next question is how to stop paying full price for the static parts you send on every request.

Prompt caching solves this directly. Most teams discover it when the invoice arrives.

The Problem: Static Context Billed as if It Were New

A typical document analysis request has three layers:

  • System prompt — role definition, output format, constraints. 500–800 tokens, identical on every call.
  • Document context — the file under analysis. 10,000–15,000 tokens, identical across all questions on that document.
  • User message — the actual question. 50–150 tokens, unique per turn.

In a ten-question session on a 15,000-token document, you're billing roughly 157,000 tokens (15,600 × 10). The 150-token question is what changes. The rest is the same prefix, repriced every time.

Prompt caching solves this by keeping a KV cache representation of a prompt prefix server-side. When you send the same prefix again within the cache window, the provider skips reprocessing and charges a fraction of the normal input rate. The question still gets processed fresh — the static scaffolding doesn't.

How Each Provider Handles Caching

The three providers take meaningfully different approaches:

ProviderMechanism in LangChainWhat the docs explicitly sayOperational note
ClaudeExplicit cache_control on content blocksPrompt caching is beta, requires a beta header, supports caching tools, messages, and blocks; cache lifetime is 5 minutes and refreshes on hitMinimum cacheable length varies by model
OpenAIImplicit by default; prompt_cache_key also exists as an explicit controlNewer models automatically cache large prompt prefixes; LangChain cites 1024 tokens as the current thresholdCache details may appear in provider-specific metadata
GeminiImplicit by default; explicit cached content can be referencedExact-prefix matches can receive cache savings; LangChain can pass a cachedContent reference if the cache was created outside LangChainLangChain does not create the explicit cache for you

OpenAI and Gemini both support implicit caching, so repeated prompt prefixes can benefit without adding provider-specific markers to every message. But "implicit" doesn't mean "nothing to know": OpenAI also exposes prompt_cache_key as an explicit control in LangChain, and Gemini can consume an explicit cached content reference created outside LangChain.

Anthropic is the most explicit of the three. You mark specific content blocks with cache_control: {"type": "ephemeral"} to designate cache breakpoints. That gives you precise control over what gets cached, but it also means a missed marker means no caching at all.

The practical consequence in a multi-provider codebase: OpenAI and Gemini can reward stable prefixes passively, while Anthropic requires deliberate request shaping and the provider-specific beta setup documented by LangChain.

Implementing Caching with LangChain

For the document analysis agent using Claude, the setup targets the two static layers - system prompt and document context - and leaves the user question uncached.

python
from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage, SystemMessage
 
model = ChatAnthropic(
    model="claude-sonnet-4-5-20250929",
    # Anthropic prompt caching is documented as beta in LangChain.
    default_headers={"anthropic-beta": "prompt-caching-2024-07-31"},
)
 
SYSTEM_PROMPT = (
    "You are a senior legal analyst. Extract obligations, deadlines, "
    "and risk factors from contracts. Be precise and cite clause numbers."
)
 
 
def build_cached_prefix(document_text: str) -> list:
    """Build the static, cacheable part of the conversation."""
    system = SystemMessage(content=[{
        "type": "text",
        "text": SYSTEM_PROMPT,
        "cache_control": {"type": "ephemeral"},
    }])
 
    # Document context: large, static per session — mark for caching
    doc_context = HumanMessage(content=[{
        "type": "text",
        "text": f"Contract under review:\n\n{document_text}",
        "cache_control": {"type": "ephemeral"},
    }])
 
    return [system, doc_context]
 
 
async def analyze_document(document_text: str, questions: list[str]) -> list[str]:
    """Run multiple questions against the same document, paying for context once."""
    cached_prefix = build_cached_prefix(document_text)
    results = []
 
    for question in questions:
        # Only the question changes; the cached prefix can be reused on later calls.
        messages = cached_prefix + [HumanMessage(content=question)]
        response = await model.ainvoke(messages)
        results.append(response.content)
 
    return results

Two cache_control markers, two cache breakpoints. On the first call, Anthropic creates the cache entries for those blocks. On later calls within the cache window, the same prefix can be read from cache instead of being fully reprocessed.

The LangChain docs I checked document Anthropic's standard 5-minute cache lifetime and note that the window refreshes when the cache is hit. If your workflow has long gaps between requests, verify any extended-TTL options directly against the current provider docs before depending on them in production.

For OpenAI, the same prompt shape can benefit from implicit caching without adding cache_control. LangChain's current docs say newer OpenAI models automatically cache large prompt prefixes above roughly 1024 tokens.

LangChain also documents prompt_cache_key as an explicit OpenAI control when you want cache grouping to be deliberate rather than inferred from the raw prefix, such as when several workflows share similar scaffolding.

For Gemini, LangChain documents implicit context caching by default. If the start of the history exactly matches cached context, Gemini can reduce token cost for that request. LangChain also supports passing a cachedContent reference if you created an explicit Gemini cache outside LangChain.

Track hits in the response metadata to verify caching is working. Anthropic exposes raw token counts in response_metadata. LangChain's general token-usage model also includes input_token_details in usage_metadata when providers supply normalized data, but the OpenAI prompt-caching docs specifically note that cached token counts are not yet standardized there and may instead appear in response_metadata.

python
# Claude
usage = response.response_metadata.get("usage", {})
cache_created = usage.get("cache_creation_input_tokens", 0)
cache_read = usage.get("cache_read_input_tokens", 0)
print(f"Cache write: {cache_created} | Cache read: {cache_read}")
 
# Normalized usage metadata, when the provider exposes it
input_details = (response.usage_metadata or {}).get("input_token_details", {})
cache_read = input_details.get("cache_read", 0)
print(f"Normalized cache read count: {cache_read}")

If cache read counts are consistently zero, your prefix probably isn't matching. Check whether the supposedly static prefix is actually identical between calls, whether the request is large enough for the provider's caching threshold, and whether the request pattern stays inside the provider's cache window.

What to Cache — and What Not To

Not every part of a prompt is a good cache target.

Cache these:

  • System prompts — nearly always static
  • Document or knowledge base context injected per session
  • Tool definitions if registering many tools
  • Few-shot examples that don't vary per request

Don't cache these:

  • User messages — they change every turn
  • Conversation history — it grows, breaking prefix matching
  • Dynamic context (timestamps, user-specific data) — any variation invalidates the prefix on every request

Warning: Anthropic cache creation costs extra. If a cached block is never reused inside the cache window, you pay that higher initial cost without getting a discounted read. Monitor cache_creation_input_tokens and cache_read_input_tokens on every response. If hit rates are low, your workload may be too sparse to benefit.

Takeaway

Prompt caching is one of the few cost optimizations that requires no model changes, no architectural overhaul, and no quality tradeoffs. For any workflow with a large repeated static prefix: document analysis, RAG pipelines, multi-turn agents with fixed system prompts; it can cut input token costs by 80–90% on subsequent calls.

Four things to internalize:

  • Implicit vs. explicit is a control tradeoff, not a quality difference. OpenAI and Gemini support implicit caching by default, while Anthropic requires explicit markup on cacheable blocks. Don't assume caching is happening — verify it in the metadata.
  • Static prefix is the only thing that caches. Any variation in the cacheable layers — timestamps, user IDs, dynamic context — breaks prefix matching and resets the cost.
  • Initial cache creation can cost more than a normal request. Track hit rates in production. Low-volume sessions that consistently miss the cache window can end up paying extra without seeing any benefit.
  • This pairs with model-aware context management. model.profile tells you when to summarize; prompt caching tells you what to stop repricing. Used together, they're the two main levers for cost-controlled, context-aware production pipelines.

Building a multi-model pipeline and want to get the cost structure right? Let's talk.

Sources & Further Reading

Ready to Transform Your Business?

Let's discuss how Lubu Labs can help you leverage AI to drive growth and efficiency.