Model-Aware Context Management with LangChain's model.profile
Stop hardcoding context limits. Use LangChain's model.profile to drive dynamic summarization and capability gating based on the actual model.
We had a summarization middleware that triggered at 8,000 tokens. The logic was simple: once the conversation grew beyond that threshold, compress the older messages before sending the next request. It worked fine for months.
Then we switched one workflow to Claude Sonnet 4.5 — 200k standard context, with 1M available in beta for eligible orgs — and it started summarizing conversations that were barely getting started. We'd hit 8,000 tokens on turn four of a ten-turn research session, collapse half the context, and wonder why the agent kept forgetting what the user had said two minutes earlier.
The fix wasn't the number. The problem was that the number was there at all. We had encoded a model-specific assumption — GPT-4o's practical context ceiling — into code that was supposed to be model-agnostic. Every time we swapped models, those assumptions broke silently.
LangChain's .profile attribute is built for exactly this problem. It lets your runtime read model capabilities directly, so context logic can adapt to the active model instead of assuming fixed limits.
The Problem: Blind Context Management
The hardcoded-threshold pattern is everywhere. You'll see it in production LangGraph graphs, in LangChain middleware, in RAG pipelines:
MAX_CONTEXT_MESSAGES = 20
SUMMARIZE_AFTER_TOKENS = 8_000
if len(messages) > MAX_CONTEXT_MESSAGES:
messages = summarize_and_trim(messages)The intent is reasonable — prevent context overflow, keep latency predictable. But those constants are secretly tied to whatever model you were using when you wrote them. When you move across the model landscape, they become a liability.
Consider what this looks like in practice across three common models:
- GPT-4o-mini caps at 128k input tokens. An 8k trigger threshold uses ~6% of capacity.
- Claude Sonnet 4.5 has a 200k standard context window (~4% at an 8k trigger), and can run at 1M in beta (~0.8% at the same trigger).
- Gemini 2.5 Flash supports 1,048,576 input tokens (~1M). That same 8k trigger uses only ~0.8% of available context.
The "obvious fix" is to define per-model constants and branch on model name. But that scales poorly: add five models, add five branches. Add a new provider, remember to update every piece of middleware that makes model assumptions. Miss one and you get silent degradation — wrong context limits, disabled capabilities, wasted tokens — with no error to trace.
What you actually want is to ask the model what it can do, then write logic against that answer.
model.profile: What It Is
LangChain's .profile attribute is a dict of capability metadata that chat models can expose (requires langchain>=1.1). The data comes from models.dev, an open-source registry covering hundreds of models across major providers.
The profile exposes the fields most relevant to runtime decisions:
from langchain.chat_models import init_chat_model
model = init_chat_model("claude-sonnet-4-5-20250929")
print(model.profile)
# {
# "max_input_tokens": 1000000,
# "tool_calling": True,
# "structured_output": True,
# "image_inputs": True,
# "reasoning_output": False,
# ...
# }Important: For Claude 4.5, 1M context is a beta mode on Anthropic and requires the
context-1m-2025-08-07beta header.
The key fields and what they unlock:
| Field | Type | Use |
|---|---|---|
max_input_tokens | int | Drive dynamic summarization thresholds |
tool_calling | bool | Gate tool registration and invocation |
structured_output | bool | Gate .with_structured_output() calls |
image_inputs | bool | Gate multimodal message construction |
reasoning_output | bool | Detect models that can return reasoning content |
LangChain merges models.dev data with per-provider augmentations it maintains in each provider package. You access it through the same interface regardless of provider.
Dynamic Summarization Middleware
The core use case: trigger summarization based on model.profile["max_input_tokens"] rather than a constant. Pick a threshold fraction — 80% is a reasonable default — and let the model's actual window determine when to compress.
from typing import TypedDict
from langchain_core.messages import BaseMessage, HumanMessage, SystemMessage
from langchain_core.messages.utils import count_tokens_approximately
class AgentState(TypedDict):
messages: list[BaseMessage]
def should_summarize(messages: list[BaseMessage], model, threshold: float = 0.8) -> bool:
"""Trigger summarization at a fraction of the active model's context window."""
profile = getattr(model, "profile", None) or {}
max_tokens = profile.get("max_input_tokens")
if not max_tokens:
# Fallback when profile data is unavailable.
return len(messages) > 20
estimated_tokens = count_tokens_approximately(messages)
return estimated_tokens > int(max_tokens * threshold)
async def summarize_messages(messages: list[BaseMessage], model) -> str:
"""Compress older messages into a summary."""
summary_prompt = [
SystemMessage(
content="Summarize the following conversation concisely, preserving key facts and decisions."
),
HumanMessage(
content="\n".join(f"{m.type}: {m.content}" for m in messages)
),
]
response = await model.ainvoke(summary_prompt)
return response.content if isinstance(response.content, str) else str(response.content)
async def context_management_node(state: AgentState, model) -> dict:
messages = state["messages"]
if not should_summarize(messages, model):
return {}
# Keep first system message (if present) and the last 4 turns.
head = [messages[0]] if messages and isinstance(messages[0], SystemMessage) else []
keep_last_n = 8
split_idx = max(len(head), len(messages) - keep_last_n)
to_summarize = messages[len(head):split_idx]
tail = messages[split_idx:]
if not to_summarize:
return {}
summary_text = await summarize_messages(to_summarize, model)
summary_msg = SystemMessage(content=f"[Conversation summary: {summary_text}]")
return {"messages": head + [summary_msg] + tail}The critical property here: the same node handles GPT-4o-mini (128k window, triggers at ~102k tokens) and Gemini 2.5 Flash (~1M window, triggers at ~839k tokens at an 80% threshold) without model-specific branching. Swap the model at the top of your graph, and the middleware recalibrates.
The 80% threshold is deliberate. You want headroom for the model's response tokens, tool call outputs, and any system prompt expansion. Running at 100% invites truncation errors; running too conservatively wastes capacity. 75–85% is the practical sweet spot.
Beyond Context: Capability Gating
The same pattern extends to any capability check. Two cases that come up frequently in multi-model deployments:
Structured output fallback. Not all models support .with_structured_output(). For those that don't, you need a prompt-based extraction path:
from pydantic import BaseModel
class ExtractionResult(BaseModel):
entities: list[str]
sentiment: str
def extract_structured(text: str, model) -> ExtractionResult:
profile = getattr(model, "profile", None) or {}
if profile.get("structured_output"):
structured_model = model.with_structured_output(ExtractionResult)
return structured_model.invoke(text)
else:
# Prompt-based fallback for models without native structured output
response = model.invoke(
f"Extract entities and sentiment from: {text}\n"
"Reply in JSON: {{\"entities\": [...], \"sentiment\": \"...\"}}"
)
import json
data = json.loads(response.content)
return ExtractionResult(**data)Modality gating. Passing image URLs to a model that doesn't support vision causes a runtime error — often one with an opaque error message. Gate it at construction time:
from typing import Any
def build_message_content(text: str, image_url: str | None, model) -> list[dict[str, Any]]:
content = [{"type": "text", "text": text}]
profile = getattr(model, "profile", None) or {}
if image_url and profile.get("image_inputs"):
# Cross-provider standard content block
content.append({"type": "image", "source_type": "url", "url": image_url})
elif image_url:
# Degrade gracefully: mention the image in text instead of sending it
content[0]["text"] += f"\n[Image provided but not supported by this model: {image_url}]"
return contentBoth patterns follow the same structure: read capability from the profile, branch on the result, handle the unsupported case explicitly. No try/catch around model calls, no provider-specific conditionals, no magic strings.
When Profile Data Is Wrong
models.dev is community-maintained. For new model releases, niche fine-tunes, or models from smaller providers, the profile may be missing, stale, or simply wrong. This is a real limitation worth knowing before you ship.
Two override strategies, depending on how much control you need:
Quick fix at instantiation. Pass profile= directly to init_chat_model. This overrides the registry lookup entirely:
model = init_chat_model("some-new-model", profile={
"max_input_tokens": 100_000,
"tool_calling": True,
"structured_output": True,
"image_inputs": False,
})Non-mutating update for a specific invocation. If you need different profile values for a scoped context without mutating shared model state, use model_copy:
# Override max_input_tokens without touching the original model object
conservative_profile = (model.profile or {}) | {"max_input_tokens": 50_000}
conservative_model = model.model_copy(update={"profile": conservative_profile})This is particularly useful in multi-tenant applications where different users or workflows need different effective limits on the same underlying model.
One honest caveat: model profiles are in beta. The field names and structure may change as LangChain finalizes the API. Pin your LangChain version in production, and check the changelog when upgrading. That said, the direction is clear — this surface area is expanding, not contracting.
Takeaway
Model profiles let you write context management that adapts to the model in use rather than encoding assumptions about a specific one. The core principle: query capabilities at runtime, branch on results, and handle unsupported cases gracefully.
Four things worth internalizing from this pattern:
- Use fractions, not constants.
max_input_tokens * 0.8ages better than8_000when models change. - Gate before you invoke. Checking
profile["image_inputs"]before constructing a multimodal message is cheaper than catching a runtime error. - Degrade explicitly. When a capability isn't present, do something intentional — fall back, skip, or inform — rather than silently passing wrong input.
- Override when needed, contribute when possible. The registry is useful only if it's accurate; model-specific workarounds you keep local are debt.
The broader shift is about where model assumptions live. In most codebases today, they're scattered across middleware constants, provider-specific branches, and environment variable configs. model.profile gives you one place to centralize them — and one place to remove them when the model changes.
Need help designing context management that scales across your model roster? Let's talk.
Sources & Further Reading
Ready to Transform Your Business?
Let's discuss how Lubu Labs can help you leverage AI to drive growth and efficiency.
