perea.ai Research · 1.0 · Public draft

Eval-Driven Development for AI Agents

Red-Green-Refactor for Non-Deterministic Systems — DeepEval, LangSmith, Braintrust, Phoenix, Promptfoo Compared

AuthorDante Perea
PublishedMay 2026
Length2,944 words · 13 min read
AudienceEngineering leaders shipping AI agents to production, ML platform teams designing CI gates for prompt and model changes, and founders deciding which evaluation tooling to adopt.
LicenseCC BY 4.0

#Eval-Driven Development for AI Agents

#What this paper is, in one sentence

Eval-Driven Development (EDD) is the discipline — explicitly endorsed as Anthropic's official practice — of writing eval suites before agent code, gating every PR against a baseline through automated CI checks, and treating the eval suite as the executable specification of correct agent behavior, on a tooling stack that has consolidated by mid-2026 into five real options (DeepEval, LangSmith, Braintrust, Arize Phoenix, Promptfoo) each backed by 9-figure capital and serving distinct integration patterns.[1][2][3]

#Why eval-driven development is the discipline that wins in 2026

The shift from "test the deterministic function" to "test the probabilistic agent" is structural, not stylistic. The same prompt sent twice to the same model can produce different tool-call sequences, different arguments, and different final answers — all "correct" by any reasonable definition.[4] The Anthropic engineering team measured what model upgrades alone do on a fixed agent: SWE-bench Verified scores jumped from roughly 40% to over 80% in a single year, which sounds like good news until you realize a model upgrade can simultaneously regress on your specific use case.[1][4] Without a systematic eval suite, every model swap is a gamble.

The blocker is real and quantified. 32%[4] of teams cite quality as the single biggest obstacle, ahead of latency at 20%[4] and security, per the LangChain State of Agent Engineering Report 2025–2026[5] (May 2026[5]). In production AI systems, 60–70%[6] of quality degradations come from prompt drift, model version updates, or dependency changes — invisible without automated evals.[6] A mid-size AI product (10K users[6]) loses $5,000–20,000[6] in churn per major regression before it's caught.[6] Teams have seen agents launch at 20%[4] task-completion and reach 60%+[4] after eval-driven optimization, per Master of Code's 2026[7] AI Metrics report.[4][8][7]

Anthropic's recommendation (Jan 9, 2026[1]) makes this explicit: "build evals to define planned capabilities before agents can fulfill them, then iterate until the agent performs well."[1][9] The post lays out a roadmap for going from no evals to evals you can trust — the framing is "eval-driven agent development: define success early, measure it clearly, and iterate continuously."[1] Claude Code itself shipped this way: fast iteration based on Anthropic-employee and external-user feedback first, then evals — first for narrow areas like concision and file edits, then for complex behaviors like over-engineering. Combined with production monitoring, A/B tests, and user research, those evals are what continued to improve Claude Code as it scaled.[1]

#The Red-Green-Refactor cycle, translated to probabilistic systems

EDD is TDD restated for non-deterministic outputs. The classic cycle maps cleanly:[10][11]

TDDEval-Driven Development
Write a failing unit testWrite an eval that scores a baseline agent (low pass rate is the feature)
Write code to pass the testIterate on prompts, tools, model choice until the suite passes at threshold
Refactor without breaking testsSwap models, prompts, or tools while maintaining eval scores
Run tests on every commitRun evals on every PR via CI/CD
Fix regressions immediatelyCatch score drops before deployment

The discipline is temporal: writing evals after the fact reverse-engineers success criteria from a live system, which embeds the agent's current bugs into the definition of correct.[12] The eval suite then validates what the agent already does rather than what it should do. A low pass rate on a new capability eval is a feature, not a problem — it identifies the gap and makes progress visible as implementation proceeds.[12][9]

Three operational rules that distinguish real EDD from periodic testing:[4][11]

1. 20–50 tasks beats hundreds. Anthropic explicitly recommends starting with 20–50 tasks drawn from real failures.[1] Early on, each change has a noticeable impact, so small samples suffice; precision matters more than volume. Start small or you'll never start.

2. Evals run automatically on every PR, not periodically. A suite you run manually once a sprint is not eval-driven development — it's periodic testing. The practice only becomes continuous when evals run automatically on every prompt and model change.[11]

3. Threshold budgets, not binary pass/fail. A ToolCorrectnessMetric threshold of 0.75[4] on baseline is reasonable; if a PR drops it below 0.60[4], block the merge.[11] Run on a fixed test set in CI for deterministic gates; reserve randomized adversarial testing for scheduled weekly runs.[11]

The single largest eval anti-pattern is graders that don't catch failures: a team's eval score on CORE-Bench jumped from 42% to 95% simply by fixing bugs in the evaluation harness itself.[4][1] If your TaskCompletionMetric threshold is set too low, or expected tool definitions are underspecified, the metric will pass on wrong answers. Audit your graders as carefully as you audit your agent code.

#The CI gate architecture: what a real eval pipeline does

Across LangSmith's CI/CD docs, Anthropic's harness-design write-up, the LLMversus reference architecture, and the Promptproof / EvalGate / EvalCI / ai-workflow-evals open-source actions, a converged 2026 reference architecture has emerged.[13][14][15][16][17][18]

A production eval CI gate has six components:[13]

  1. Test runner — orchestrates the eval run, supports 50–100 concurrent API calls, keeps eval time under 5 minutes per run.
  2. Model under test — the candidate model/prompt combination. The harness must abstract the model interface so swaps don't require eval rewrites.
  3. Deterministic scorer — exact-match, JSON schema, regex, label accuracy, latency budget, cost budget. Cheap, fast, reliable.[13]
  4. LLM-as-judge scorer — for semantic / qualitative criteria with explicit rubrics; Claude Haiku 4 at $0.80/M[13] tokens is the 2026[13] default; gpt-4o-mini[13] is the cheapest credible iteration alt.[13]
  5. Baseline results store — per-example results from current production. Candidate runs compute deltas against this anchor, not just aggregate scores.[13]
  6. CI gate — blocks merge on >3%[13] aggregate drop, or >1%[13] drop on any specific category (safety, refusals).[13] Posts a PR comment with the diff.[13]

Run characteristics from reference implementations:[13][14]

  • 500 examples × LLM-as-judge: 2–5 minutes[13] per run[13]
  • $0.50–2.00[13] in API cost per run on frontier judge models; $0.005[13] in GitHub Actions compute[13]
  • 50 examples on gpt-4o-mini: ~$0.02[14] per run[14]

The path-filter discipline matters: trigger evals only when prompt files, model config, or RAG pipeline code changes (paths: ['src/prompts/**'])[13], not on every commit. This reduces eval CI runs by 80–90%[13] while catching all prompt-related regressions.[13][14]

The pattern that production teams converge on, documented across Luong Hong Thuan's Multi-Agent Deep Dive and the AgentMarketCap eval-platform survey, is a 5-stage pipeline: lint → smoke (1 example) → comprehensive (full set) with budget ceiling → delta vs production baseline → PR comment with results table. Merging is blocked if any regression-blocking metric drops below the prior baseline.[19][4]

#The five tools that matter, compared honestly

The 2026 evaluation tooling market has consolidated to five serious options, each capitalized at 9-figure scale and serving a distinct integration pattern:[19][20][21][22]

ToolOriginApproachOSSFree tierPaid startingBest fit
DeepEvalConfident AIPytest-native, 50+ metricsYes (Apache-2.0)Unlimited (core)$19.99/user/mo (cloud)Code-first dev workflow, agent metrics
LangSmithLangChainLangGraph-native tracing + evalNo5K traces/mo$39/seat/moLangChain/LangGraph teams
BraintrustBraintrustEval-first experiment platformNo1M spans/mo$249/moTypeScript/JS teams, CI/CD eval gates
Arize PhoenixArize AIOpenTelemetry-native observabilityYes (Apache-2.0)Unlimited self-hostedArize AX (custom)Multi-framework, self-hosted, OTel
PromptfooPromptfooCLI-first, red teamingYes (MIT)UnlimitedEnterprise (custom)Free CLI, CI/CD, security testing

Capitalization snapshot, mid-2026[20]:[20][23]

  • LangSmith (LangChain): $260M[20] raised, $1.25B[20] valuation
  • Braintrust: $125M[20] raised, $80M[20] Series B Feb 2026[20], $800M[20] valuation
  • Arize AI (Phoenix): $131M[20] raised, $70M[20] Series C Feb 2025
  • Patronus AI (compliance-first): $40M[20] raised
  • Confident AI (DeepEval cloud): early-stage[23]

Adoption signal. DeepEval has surpassed 3 million[24] monthly PyPI downloads, 26 million[24] all-time, with 15,220[25] GitHub stars and 260[25] contributors as of May 2026[25] — the most-downloaded LLM evaluation framework.[24][25] Phoenix logs over 2 million[20] monthly downloads — the most widely adopted LLM observability library by raw usage[20] — built on OpenTelemetry from the ground up, which is why enterprise teams evaluating eval platforms in 2026[20] universally benchmark against Phoenix's trace coverage.[20][19] LangSmith captured the majority of LangChain's hundreds of thousands of monthly active developers because instrumentation is zero-config[20] — and 53.3%[11] of practitioners now use LLM-as-judge approaches for at least some checks.[11][20]

The honest tradeoff matrix.[19][22]

  • DeepEval wins on metrics breadth and pytest integration: 50+[19] research-backed metrics including G-Eval, task completion, answer relevancy, hallucination, multi-turn synthetic data generation.[24] Loses on production tracing (evaluation-first, not observability-first).[19]
  • LangSmith wins inside the LangChain ecosystem and loses outside it. The March 2026[22] Sandboxes launch and NVIDIA partnership made it end-to-end (eval + deployment + secure code execution).[22] Framework lock-in is the central tradeoff.[22]
  • Braintrust wins on the experiment-comparison UX, the CI/CD eval gates for TypeScript/JavaScript teams, and the trace-to-test pipeline that turns production failures into datasets automatically.[26] Loses on production-tracing polish.
  • Arize Phoenix wins on portability and zero-cost self-hosting; OTel-native architecture means it works with any framework — LangGraph, CrewAI, Claude Agent SDK, OpenAI Agents SDK, Vercel AI SDK, LlamaIndex, Mastra. Loses on the prescriptive experiment workflow Braintrust prioritizes.[27][19]
  • Promptfoo wins on $0 forever pricing and red teaming. CLI-first, self-hostable, MIT-licensed — the right choice if your eval pipeline is a Python script and your CI gate is a single command.

The default 2026[19] stack without budget constraints: Promptfoo for CLI testing + Phoenix for tracing.[19] With budget: Braintrust's free tier (1M[23] spans, unlimited users, 10K[23] evals — the most generous in this market).[23] For LangChain shops: LangSmith.[19] For pytest-native teams: DeepEval.[28]

#The GitHub Actions ecosystem: pluggable PR gates

By mid-2026 the open-source GitHub Action ecosystem for AI eval gates has converged on a small, interoperable set:[17][18][29][30][28]

  • ollieb89/ai-workflow-evals — runs .eval.yml suites, baseline-aware regression detection, PR comments, dry-run mode.[17]
  • aotp-ventures/evalgate — composite action, deterministic + LLM-judge evals, fixture-based, regression vs main.[18]
  • geminimir/promptproof-action — zero network calls in CI (replays recorded fixtures), schema/regex/budget rules, HTML/JUnit/JSON/SARIF reports, snapshot-promote-on-main pattern.[29]
  • synapsekit/evalci@eval_case decorator, 30+ provider support, formatted PR comment table, mean-score outputs for downstream steps.[30]

The shared shape is striking and instructive: every one of these actions accepts a config file, runs against a baseline (often pulled from origin/main), supports a regression-threshold gate, and posts a PR comment with the score table. The code path that matters in practice is path: filter → setup → run evals → diff vs baseline → comment → exit non-zero on regression. This is now a settled pattern; the choice between actions is largely about which scoring config and report format match your team's preferences.

The 80%[13] solution for a new agent project: clone Promptfoo's init template, write 20[1] eval cases drawn from real or anticipated failures, wire ai-workflow-evals[17] or evalci[30] into .github/workflows/eval.yml with a paths filter on prompt files, set the threshold at 0.80[14], and turn on branch-protection requiring the eval check. Total setup time: under two hours.[11] First regression caught: typically within the first week.[11][13]

#Five anti-patterns, all field-tested

1. Path-not-outcome graders. Requiring an exact tool-call sequence makes tests brittle and masks valid solutions that take a different route. Use plan-quality / outcome graders for general agents; reserve strict path assertions for deterministic sub-tasks where there is only one correct sequence.[4]

2. Tautological tests from context bleed. If the red-phase agent sees a draft implementation in session history, it writes tests that mirror the implementation, not the behavior. Keep red, green, and refactor as separate agent invocations with separate instructions; "do not change the tests" is a load-bearing constraint in the green phase.[31] METR's 2025 evaluations document frontier models "modifying test or scoring code" rather than implementing required behavior, and Anthropic's reward-hacking work describes training examples where sys.exit(0) is used to make all tests appear to pass.[31]

3. Moving the bar to match the pass rate. If the agent reaches 82% and your bar was 90%, do not lower the bar to ship. Either improve the agent or explicitly accept the risk with documentation of which failure modes remain.[12]

4. Treating the suite as immutable. A suite that doesn't change after launch is a suite not learning from production. Feed low-scoring production traces back as eval data; tag them with the observed failure mode.[12][4]

5. Judge drift. The judge LLM itself can drift — a wording change in the judge prompt causes score shifts unrelated to the model under test.[13] Version-control the judge prompt separately, run judge calibration evals quarterly, score 50[13] human-labeled examples and compute correlation, alert if it drops below 0.80[13].[13]

#What this means for engineering leaders shipping in 2026

EDD is no longer a fringe practice. The frontier-lab consensus (Anthropic's January 2026[1] Engineering post is the canonical reference[1]), LangChain State-of-Agent-Engineering data (53%[11] LLM-as-judge adoption, 70%[4] offline eval adoption), and venture capital concentration ($1.35B[20] raised across the four leading vendors[20]) point the same direction: shipping agents without an automated CI eval gate is the 2026[20] equivalent of merging code without unit tests.[1][4][20]

The decision a leader actually faces in 2026 is not whether to do EDD but where to position on three axes: framework lock-in (LangSmith vs everything else), self-hosting requirement (Phoenix vs SaaS), and TypeScript-first vs Python-first (Braintrust vs DeepEval). The combinatorics collapse to four patterns:

  • Python + LangChain: LangSmith for tracing, DeepEval for unit-test-style evals
  • TypeScript / mixed framework: Braintrust + Phoenix
  • Self-hosted requirement: Phoenix + Promptfoo
  • Free / minimal infra: Promptfoo + Phoenix self-hosted

Pick one, set up your first 20 evals this week, wire the GitHub Action this month. The compounding effect — every production failure becomes test data; every PR runs the suite; every model upgrade is gated — is the difference between shipping confidently and hoping nobody notices when things break.

#What this paper does not cover

This paper does not cover: specific implementations of agent trajectory evaluation (a deeper subject in the trajectory testing canon), the design of LLM-as-judge rubrics in detail (worth its own paper), reward-hacking detection beyond the surface mention here, jailbreak / red-team eval frameworks (Patronus and Promptfoo's red-teaming surfaces, treated separately), hyperscaler agent-runtime native eval features (AgentCore, Vertex Eval, Foundry), or the procurement-and-vendor-contract dimension of selecting an eval platform at enterprise scale (covered in the EU AI Act Vendor Contract Clause Library paper).

#References

References

  1. Anthropic, Demystifying evals for AI agents. https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents (Jan 9, 2026) 2 3 4 5 6 7 8 9 10 11 12

  2. Anthropic, Demystifying evals (mirror). https://tessl.co/jgh

  3. Anthropic, Harness design for long-running application development. https://www.anthropic.com/engineering/harness-design-long-running-apps

  4. The Agentic Blog, Eval-Driven Development for AI Agents: Practical Guide. https://blog.appxlab.io/2026/04/06/eval-driven-development-ai-agents/ (Apr 6, 2026) 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

  5. LangChain, State of Agent Engineering Report 2025–2026. https://blog.langchain.dev/state-of-agent-engineering-2025-2026/ 2

  6. LLMversus, Automated LLM Evaluation Harness: CI/CD for AI Quality — Reference Architecture. https://llmversus.com/architecture/llm-eval-harness (Apr 16, 2026) 2 3 4 5

  7. Master of Code, 2026 AI Metrics Report. https://masterofcode.com/research/ai-metrics-2026 2

  8. The Agentic Blog, Eval-Driven Development for AI Agents [Complete Guide]. https://blog.appxlab.io/2026/04/08/eval-driven-development-ai-agents-2/ (Apr 8, 2026)

  9. AgentPatterns, Eval-Driven Development: Write Evals Before Building Agent. https://agentpatterns.ai/workflows/eval-driven-development/ 2

  10. AgentPatterns, Red-Green-Refactor with Agents: Letting Tests Drive Dev. https://agentpatterns.ai/verification/red-green-refactor-agents/

  11. AgentPatterns, The Eval-First Development Loop for AI Agent Features. https://agentpatterns.ai/training/eval-driven-development/eval-first-loop/ 2 3 4 5 6 7 8 9 10

  12. AgentPatterns, Eval-First Loop pitfalls and discipline. https://agentpatterns.ai/training/eval-driven-development/eval-first-loop/ 2 3 4

  13. LLMversus, LLM Eval Harness Reference Architecture. https://llmversus.com/architecture/llm-eval-harness 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

  14. Markaicode, LangSmith CI/CD Integration: Automated Regression Testing 2026. https://markaicode.com/langsmith-cicd-automated-regression-testing/ (Mar 9, 2026) 2 3 4 5 6

  15. MyEngineeringPath, Prompt Testing & Optimization — Evals, A/B Testing & CI/CD (2026). https://myengineeringpath.dev/genai-engineer/prompt-testing/

  16. Evidently AI, CI/CD for LLM apps: Run tests with Evidently and GitHub actions. https://www.evidentlyai.com/blog/llm-unit-testing-ci-cd-github-actions

  17. ollieb89/ai-workflow-evals, GitHub Action for AI behavioral regression testing. https://github.com/ollieb89/ai-workflow-evals (Mar 22, 2026) 2 3 4

  18. AOTP-Ventures/evalgate, GitHub Action for LLM/RAG evals as PR checks. https://github.com/AOTP-Ventures/evalgate 2 3

  19. Techsy, 8 Best LLM Evaluation Tools, Ranked — Honest Picks 2026. https://techsy.io/blog/best-llm-evaluation-tools (Mar 18, 2026) 2 3 4 5 6 7 8 9 10

  20. AgentMarketCap, The Race to Fix AI Agent Quality: Braintrust vs LangSmith vs Arize vs Patronus. https://agentmarketcap.ai/blog/2026/04/06/agent-eval-infrastructure-braintrust-langsmith-arize-patronus (Apr 6, 2026) 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

  21. AgentMarketCap, The $500M Eval War. https://agentmarketcap.ai/blog/2026/04/06/agent-eval-infrastructure-braintrust-langsmith-arize-patronus-500m-market

  22. Medium (Anudeep), LangSmith vs Arize vs Braintrust — definitive 2026 comparison. https://medium.com/%40anudeepsri/langsmith-vs-arize-vs-braintrust-e397e4728a76 (Mar 21, 2026) 2 3 4 5

  23. Latitude, Best LLM Observability Tools for AI Agents 2026. https://latitude.so/blog/best-llm-observability-tools-agents-latitude-vs-langfuse-langsmith 2 3 4 5

  24. PyPI, deepeval v3.9.7 download stats. https://pypi.org/project/deepeval/ 2 3 4

  25. GitHub, confident-ai/deepeval. https://github.com/confident-ai/deepeval 2 3 4

  26. Braintrust, Arize Phoenix vs Braintrust comparison. https://www.braintrust.dev/articles/arize-phoenix-vs-braintrust (Oct 9, 2025)

  27. genai.qa, AI Agent Trajectory Testing 2026. https://genai.qa/ai-agent-trajectory-testing-2026/ (Apr 22, 2026)

  28. DeepEval, Pytest-native evals that run in CI/CD. https://deepeval.com/ 2

  29. geminimir/promptproof-action, deterministic LLM contract checks for CI. https://github.com/geminimir/promptproof-action 2

  30. SynapseKit/evalci, LLM quality gates for every PR. https://github.com/SynapseKit/evalci (Apr 9, 2026) 2 3

  31. METR + Anthropic, Reward Hacking and natural emergent misalignment. https://metr.org/research/recent-frontier-models-reward-hacking 2

perea.ai Research

One deep piece a month. Three weekly signals.

Get every B2A field report, protocol update, and benchmark from real audits — published before the rest of the market sees it. No filler. Unsubscribe in one click.