01
All Posts
Agent EvaluationMarch 202612 min read

How LangChain Builds Evals for Deep Agents

TL;DR

The best agent evals directly measure agent behavior that matters. Here is how the LangChain team sources data, creates metrics, and runs targeted experiments to make agents more accurate and reliable.

Evals shape agent behavior

The LangChain team has been curating evaluations to measure and improve Deep Agents. Deep Agents is an open source, model agnostic agent harness that powers products like Fleet and Open SWE. Evals define and shape agent behavior, which is why it is so important to design them thoughtfully.

Every eval is a vector that shifts the behavior of your agentic system. For example, if an eval for efficient file reading fails, you will likely tweak the system prompt or the read_file tool description to nudge behavior until it passes. Every eval you keep applies pressure on the overall system over time.

It is crucial to be thoughtful when adding evals. It can be tempting to blindly add hundreds (or thousands) of tests. This leads to an illusion of "improving your agent" by scoring well on an eval suite that may not accurately reflect behaviors you care about in production.

More evals does not equal better agents. Instead, build targeted evals that reflect desired behaviors in production.

When building Deep Agents, LangChain catalogs the behaviors that matter in production, such as retrieving content across multiple files in the filesystem or accurately composing 5+ tool calls in sequence. Rather than using benchmark tasks in aggregate, the team takes the following approach to eval curation:

  • Decide which behaviors the agent should follow. Then research and curate targeted evals that measure those behaviors in a verifiable way.
  • For each eval, add a docstring that explains how it measures an agent capability. This ensures each eval is self-documenting. Each eval is also tagged with categories like tool_use to enable grouped runs.
  • Review output traces to understand failure modes and update eval coverage.

Because every eval run is traced to a shared LangSmith project, anyone on the team can jump in to analyze issues, make fixes, and reassess the value of a given eval. This creates shared responsibility for adding and maintaining good evals. Running many models across many evals can also get expensive, so targeted evals save money while improving the agent.

In this post

  • 01How LangChain curates data
  • 02How LangChain defines metrics
  • 03How LangChain runs the evals

How LangChain curates data

There are a few ways the team sources evals:

  • Using feedback from dogfooding their agents
  • Pulling selected evals from external benchmarks (like Terminal Bench 2.0 or BFCL) and often adapting them for a particular agent
  • Writing artisanal evals and unit tests by hand for behaviors the team thinks are important

The team dogfoods their agents every day. Every error becomes an opportunity to write an eval and update the agent definition and context engineering practices.

Note: LangChain separates SDK unit and integration tests (system prompt passthrough, interrupt config, subagent routing) from model capability evals. Any model passes those tests, so including them in scoring adds no signal. You should absolutely write unit and integration tests, but this post focuses solely on model capability evals.

Dogfooding agents and reading traces are great sources of evals

This makes finding mistakes possible. Traces give the team data to understand agent behavior. Because traces are often large, a built-in agent like Polly or Insights is used to analyze them at scale. You can do the same with other agents (like Claude Code or the Deep Agents CLI) plus a way to pull down traces, like the LangSmith CLI. The goal is to understand each failure mode, propose a fix, rerun the agent, and track progress and regressions over time.

For example, a large fraction of bug-fix PRs are now driven through Open SWE, the open source background coding agent. Teams using it touch many different codebases with different context, conventions, and goals. This naturally leads to mistakes. Every interaction of Open SWE is traced, so those can easily become evals to make sure the mistake does not happen again.

LangChain groups evals by what they test

It is helpful to have a taxonomy of evals to get a middle view of how agents perform, rather than a single number or individual runs.

Tip: Create that taxonomy by looking at what evals test, not where they come from. For example, tasks from FRAMES and BFCL could be tagged "external benchmarks," but that would not show how they measure retrieval and tool use, respectively.

Today, all evals are end-to-end runs of an agent on a task. The team intentionally encourages diversity in eval structure. Some tasks finish in a single step from an input prompt, while others take 10+ turns with another model simulating a user.

How LangChain defines metrics

When choosing a model for an agent, the team starts with correctness. If a model cannot reliably complete the tasks that matter, nothing else does. Multiple models are run across evals and the harness is refined over time to address issues that surface.

Measuring correctness depends on what is being tested. Most internal evals use custom assertions such as "did the agent parallelize tool calls?". External benchmarks like BFCL use exact matching against ground truth answers from the dataset. For evals where correctness is semantic, like whether the agent persisted the correct thing in memory, LLM-as-a-judge is used.

Once several models clear that bar, the team moves to efficiency. Two models that solve the same task can behave very differently in practice. One might take extra turns, make unnecessary tool calls, or move through the task more slowly because of model size. In production, those differences show up as higher latency, higher cost, and a worse overall user experience.

Ideal trajectories as a reference point

To make model comparisons actionable, the team examines how models succeed and fail. That requires a concrete reference point for what "good" execution looks like beyond accuracy. One primitive used is an ideal trajectory: a sequence of steps that produces a correct outcome with no unnecessary actions.

For simple, well-scoped tasks, the variables are defined tightly enough that the optimal path is usually obvious. For more open-ended tasks, a trajectory is approximated using the best-performing model seen so far, then the baseline is revisited as models and harnesses improve.

Consider a simple request: "What is the current time and weather where I live?" An ideal trajectory makes the fewest necessary tool calls (resolve user, resolve location, fetch time and weather), parallelizes independent tool calls where possible, and produces the final answer without unnecessary intermediate turns. An inefficient but correct trajectory might take 6 steps and 5 tool calls versus 4 steps and 4 tool calls, increasing latency and cost while creating more opportunities for failure.

How LangChain runs evals

The team uses pytest with GitHub Actions to run evals in CI so changes run in a clean, reproducible environment. Each eval creates a Deep Agent instance with a given model, feeds it a task, and computes correctness and efficiency metrics.

A subset of evals can also be run using tags to save costs and measure targeted experiments. For example, when building an agent that requires a lot of local file processing and synthesis, the focus might be on the file_operations and tool_use tagged subsets.

bash
export LANGSMITH_API_KEY="lsv2_..."

uv run pytest tests/evals --eval-category file_operations --eval-category tool_use --model baseten:nvidia/zai-org/GLM-5

The eval architecture and implementation is open sourced in the Deep Agents repository.

What's next

The LangChain team is expanding the eval suite and doing more work around open source LLMs. Some things coming soon:

  • How open models measure against closed frontier models across eval categories
  • Evals as a mechanism to auto-improve agents for tasks in real time
  • How to maintain, reduce, and expand evals per agent over time