Token Cost Attribution in Multi-Model LangChain Pipelines
UsageMetadataCallbackHandler and get_usage_metadata_callback() let you attribute LLM costs per user, workflow, or agent node without building your own accounting layer.
The month-end invoice arrived three weeks into production. Forecast: $1,400. Invoice: $4,200.
The stack was a mixed Claude + GPT-4o pipeline - a LangGraph agent routing documents through a GPT-4o-mini classifier, a Claude Haiku extractor, and a GPT-4o synthesizer. The OpenAI invoice showed total GPT-4o-mini tokens for the month. The Anthropic invoice showed total Claude Haiku tokens. Neither was wrong. Neither was useful. No per-user breakdown. No per-workflow breakdown. Three line items for a pipeline running eight distinct workflows across forty enterprise customers.
The team spent three days correlating LangSmith traces to session IDs in the application database, reconstructing approximately where the overrun came from. The culprit: a batch summarization job triggered by one enterprise customer's weekly data import, misconfigured to run hourly. It had consumed 60% of the monthly token budget.
Three engineer-days to answer a question that should have been a one-line query.
In the previous post on prompt versioning, we treated prompts as versioned artifacts with a rollback path. Before that, prompt caching eliminated payment for repeated static context. And model.profile adapted context management to each model's actual capabilities. Three optimizations, but optimizing without measuring is guesswork. This post adds the measurement layer: attributing token costs to the user, workflow, or agent node that generated them, using two LangChain callbacks that most teams haven't discovered yet.
The Problem: LLM Costs Don't Come With Labels
Provider invoices are model-level aggregates. OpenAI tells you total GPT-4o-mini input and output tokens for the billing period. Anthropic tells you total Claude Haiku tokens. Neither knows which user request or workflow triggered each call. That attribution layer is entirely your problem.
Multi-model pipelines compound this. A single user request might hit GPT-4o-mini for classification, Claude Haiku for extraction, and GPT-4o for synthesis. The cost of "that user's request" is fragmented across three line items on two provider invoices. There's no join key.
| What the invoice gives you | What you actually need |
|---|---|
| Total GPT-4o-mini input tokens this month | Tokens consumed per user session |
| Total Claude Haiku output tokens this month | Cost per workflow execution |
| Combined model cost | Cost breakdown per agent node |
Homegrown solutions - counting tokens before .invoke(), wrapping calls in custom middleware, are brittle. They miss tool call outputs, retry tokens, and streaming chunks that arrive after the initial response. Pre-call estimates are estimates; the callback gets the ground truth after each .invoke() completes.
What's missing is a layer that operates at the call graph, not the provider account. Two callbacks in langchain-core give you exactly that.
Two Tools, Two Use Cases
The distinction maps to two different attribution problems. UsageMetadataCallbackHandler is a persistent handler — it accumulates usage across however many .invoke() calls you run against it, keyed by model name. get_usage_metadata_callback() is a context manager — usage is scoped to the with block and reset after. One is for billing dashboards and session rollups; the other is for per-request cost guards.
UsageMetadataCallbackHandler — session and workflow aggregate
Instantiate once, pass in config={"callbacks": [...]} to every model call in a session or workflow:
from langchain.chat_models import init_chat_model
from langchain_core.callbacks import UsageMetadataCallbackHandler
callback = UsageMetadataCallbackHandler()
llm_1 = init_chat_model(model="openai:gpt-4o-mini")
llm_2 = init_chat_model(model="anthropic:claude-haiku-4-5-20251001")
llm_1.invoke("Classify this document", config={"callbacks": [callback]})
llm_2.invoke("Extract obligations", config={"callbacks": [callback]})
print(callback.usage_metadata)
# {
# 'gpt-4o-mini': {'input_tokens': 312, 'output_tokens': 18, 'total_tokens': 330},
# 'claude-haiku-4-5-20251001': {'input_tokens': 890, 'output_tokens': 134, 'total_tokens': 1024}
# }The dict keys are model name strings — not provider names. To reset between sessions, instantiate a new handler. There's no .reset() method; the handler accumulates as long as it lives. This makes it natural for billing rollups: one handler per user session, .usage_metadata persisted to your database at session end.
get_usage_metadata_callback() — per-request guard
from langchain_core.callbacks import get_usage_metadata_callback
BUDGET_TOKENS = 5_000
with get_usage_metadata_callback() as cb:
result = pipeline.invoke({"input": user_prompt})
total = sum(m["total_tokens"] for m in cb.usage_metadata.values())
if total > BUDGET_TOKENS:
logger.warning("Budget exceeded", extra={"tokens": total, "budget": BUDGET_TOKENS})The usage in cb is scoped to the with block only — calls outside the block are invisible to it. Use this when you need to enforce a per-request token ceiling or capture cost for a single chain execution before logging it.
If callback.usage_metadata returns an empty dict on a model you expect to track, verify your langchain-core version before anything else. Added in langchain-core 0.3.49.
The Streaming Gotcha: OpenAI Requires Opt-In
Anthropic includes usage metadata in all responses by default — streaming and non-streaming. No configuration needed.
OpenAI does not. For streaming calls, usage data requires stream_usage=True on the ChatOpenAI instance. Without it, the callback gets zero counts for any streaming call. The usage arrives as a synthetic final chunk after the stream ends; the callback handles this transparently, but the opt-in is mandatory:
from langchain_openai import ChatOpenAI
from langchain_anthropic import ChatAnthropic
from langchain_core.callbacks import UsageMetadataCallbackHandler
# OpenAI: opt-in required for streaming usage
openai_llm = ChatOpenAI(model="gpt-4o-mini", stream_usage=True)
# Anthropic: included by default, no config needed
anthropic_llm = ChatAnthropic(model="claude-haiku-4-5-20251001")
callback = UsageMetadataCallbackHandler()
for chunk in openai_llm.stream("Classify this intent", config={"callbacks": [callback]}):
pass # process chunk normally
print(callback.usage_metadata)
# Populated because stream_usage=True was set.
# Without it, 'gpt-4o-mini' is absent from this dict entirely.Set stream_usage=True on model initialization, not per call. There's no downside to enabling it and the cost of forgetting is silently wrong data — your dashboard shows Anthropic costs but GPT-4o-mini appears free.
The second gotcha is reasoning models. If your pipeline uses o3 or Claude with extended thinking enabled, reasoning tokens appear in output_token_details['reasoning'] — not in output_tokens. Teams that sum only output_tokens consistently undercount costs on reasoning-heavy workflows, then wonder why their LangSmith cost estimates don't match the invoice.
Warning: Reasoning tokens appear in output_token_details['reasoning'], not in output_tokens. If you're tracking a pipeline that uses o3 or Claude extended thinking, read output_token_details explicitly — the top-level output_tokens count will look lower than your invoice.
Multi-Tenant Attribution: Tagging by User and Workflow
The callbacks give you token counts. Attribution requires linking those counts to application-level identities — user IDs, tenant IDs, workflow IDs. The pattern is straightforward: one fresh UsageMetadataCallbackHandler per request, persisted to your database at session end.
from langchain_core.callbacks import UsageMetadataCallbackHandler
def process_user_request(user_id: str, session_id: str, prompt: str) -> dict:
callback = UsageMetadataCallbackHandler() # fresh per request — critical
config = {
"callbacks": [callback],
"metadata": { # surfaced in LangSmith trace
"user_id": user_id,
"session_id": session_id,
},
}
result = pipeline.invoke({"input": prompt}, config=config)
# Persist attribution after the call completes
persist_usage_record(
user_id=user_id,
session_id=session_id,
model_breakdown=callback.usage_metadata,
# {'gpt-4o-mini': {...}, 'claude-haiku-4-5-20251001': {...}}
)
return resultTwo layers are at work here. LangChain gives you the numbers; persist_usage_record is your billing layer — not LangChain's responsibility. The config["metadata"] dict surfaces user_id and session_id as LangSmith run metadata, enabling ad-hoc cost queries in the LangSmith UI grouped by user or workflow. These layers complement each other: LangSmith for investigation and team visibility, your database for billing logic and user-facing cost reports. The same LangSmith metadata model applies here as in the prompt versioning post — runs tagged with context are infinitely easier to investigate than anonymous traces.
Key Takeaway: One handler instance per request, not one per application. Sharing a handler across concurrent requests means usage data from different users accumulates into the same dict — attribution becomes meaningless. Each request gets its own handler.
LangGraph Integration: Cost Per Agent Node
A session-level handler gives you totals across the graph. Per-node attribution requires scoping a handler to each node's execution — get_usage_metadata_callback() fits here because its scope is a single code block.
Emit node-level cost data to state using Annotated[dict, operator.or_], which merges the per-node dicts as nodes complete without requiring a side channel:
from langgraph.graph import StateGraph
from langchain_core.callbacks import get_usage_metadata_callback
from langchain_core.runnables import RunnableConfig
from typing import TypedDict, Annotated
import operator
class AgentState(TypedDict):
messages: list
node_costs: Annotated[dict, operator.or_] # merges per-node dicts across graph
def research_node(state: AgentState, config: RunnableConfig) -> dict:
with get_usage_metadata_callback() as cb:
result = research_llm.invoke(state["messages"], config=config)
return {
"messages": [result],
"node_costs": {"research": cb.usage_metadata},
}
def writer_node(state: AgentState, config: RunnableConfig) -> dict:
with get_usage_metadata_callback() as cb:
result = writer_llm.invoke(state["messages"], config=config)
return {
"messages": [result],
"node_costs": {"writer": cb.usage_metadata},
}
# At graph completion, state["node_costs"] contains:
# {
# "research": {"gpt-4o-mini": {"input_tokens": 2600, "output_tokens": 240, ...}},
# "writer": {"claude-haiku-4-5-20251001": {"input_tokens": 890, "output_tokens": 134, ...}},
# }The practical value: once you have per-node cost data, you know where to focus optimization. In a researcher-writer-reviewer graph, the researcher node running against a large-context model is typically responsible for 75–80% of total cost. That's the node where prompt caching for static system context and RAG results has the highest return. Without node-level attribution, you're tuning the wrong lever. You can't reduce costs you can't locate.
Takeaway
Provider invoices tell you what you spent on each model this month. They don't tell you which user, workflow, or agent node spent it. Bridging that gap doesn't require custom middleware — two callbacks in langchain-core give you the data, and your application layer owns what you do with it.
Three things that carry forward:
- Match scope to purpose.
get_usage_metadata_callback()for per-request guards and debugging;UsageMetadataCallbackHandlerfor session rollups and billing aggregation. - OpenAI streaming requires
stream_usage=True. Set it on model initialization, not per call. The cost of forgetting is silently wrong data — your mixed-provider cost dashboard will show Anthropic costs correctly and OpenAI costs as zero. - One handler per request, not per application. Shared handlers across concurrent requests corrupt attribution. Instantiate fresh; persist at session end.
Dealing with unexplained token costs in a multi-model pipeline? Let's talk.
Sources & Further Reading
Ready to Transform Your Business?
Let's discuss how Lubu Labs can help you leverage AI to drive growth and efficiency.
