Agentic workflows
Use eqho-eval as a tool inside AI coding agents like Claude Code, Cursor, Windsurf, or Aider. Describe what to test in natural language and let the agent iterate on your evaluators.
Why agentic eval development works
Traditional eval writing is tedious — you manually write YAML test cases, run them, read results, adjust, repeat. With an AI coding agent:
- Describe what you want to test in plain English
- The agent reads your config, writes test cases, runs evals
- Review results together, ask the agent to fix failures
- Iterate until your pass rate is where you want it
Since eqho-eval handles all API routing through the proxy, the coding agent doesn't need access to any API keys.
Claude Code
Claude Code works especially well because it can read files, run commands, and iterate based on results — all in one session.
Getting started
# Scaffold your eval project first
eqho-eval init --campaign <id>
cd my-eval
# Start a Claude Code session in the project directory
claude
Example session
You: Read the promptfooconfig.yaml and the system prompt in prompts/.
Then write 10 test cases covering identity, tool usage, and security.
Run the eval and analyze the results.
Claude: I'll start by reading the config and prompts...
[reads promptfooconfig.yaml, prompts/agent.json, tools/agent.json]
I can see this is an appointment-setting agent named Sophia.
Let me write test cases...
[edits promptfooconfig.yaml with 10 new test cases]
Running the eval now...
[runs: eqho-eval eval --no-cache]
Results: 7/10 passed. 3 failures:
- "prompt injection" failed on GPT-4.1-mini (revealed system prompt)
- "wrong number" failed on o4-mini (didn't handle gracefully)
- "appointment scheduling" failed on all (missing tool call assertion)
Let me fix the failing tests and adjust the assertions...
[edits config, re-runs eval]
After adjustments: 9/10 passing. The remaining failure is a
genuine agent issue — the system prompt should include...
Useful commands for Claude Code
# Let Claude read the assembled prompt
eqho-eval render
# Run evals without cache (always fresh)
eqho-eval eval --no-cache
# Show detailed per-assertion results
eqho-eval results --verbose
# View results in browser
eqho-eval view
Tips
- Ask Claude to read
prompts/*.jsonandtools/*.jsonto understand the agent before writing tests - Use
eqho-eval renderoutput to let Claude analyze coverage gaps - Paste failure reasons from
eqho-eval results --verboseand ask Claude to diagnose - Ask Claude to generate edge-case tests: non-English callers, emotional tones, rapid topic changes
Cursor
In Cursor's terminal or agent mode, the workflow is similar.
Setup
# Scaffold from Cursor's terminal
eqho-eval init --campaign <id>
# Open the project in Cursor
# The generated files are ready to edit
Agent mode tips
- Open
promptfooconfig.yamland ask the agent to add tests for specific scenarios - After running evals, ask the agent to analyze
output/eval-results.jsonand suggest improvements - Use
eqho-eval postcall-evalandeqho-eval action-evalto generate specialized configs, then ask the agent to refine them - The agent can run
eqho-eval doctorto diagnose any environment issues
Other AI coding tools
The same workflow applies to Windsurf, Aider, Cline, or any tool with terminal access:
- Scaffold —
eqho-eval init --campaign <id> - Edit — modify
promptfooconfig.yamltests (the assistant can help) - Run —
eqho-eval evalexecutes through the proxy - Analyze — results in
output/eval-results.json - Iterate — refine tests based on results
No API key configuration needed on the developer's machine.
Multi-model agentic workflow
A powerful pattern: ask your coding agent to test across many models and analyze failure patterns.
You: Run the eval across GPT-4.1, Claude Sonnet, Gemini 2.5, Grok 3,
DeepSeek V3, Llama 4, and Mistral Large.
Analyze which models fail on which test categories.
Identify patterns and suggest prompt improvements.
Claude: Adding all 7 providers to the config...
[edits promptfooconfig.yaml providers section]
Running eval across all models (168 total tests)...
[runs: eqho-eval eval --wide --no-cache]
Results by category:
- Identity tests: 100% pass on all models
- Tool calls: GPT-4.1 and Claude pass, others miss arguments
- Security: DeepSeek failed 2 prompt injection tests
- Edge cases: Llama struggled with non-English inputs
Recommended prompt changes:
1. Add explicit tool argument examples to reduce errors
2. Strengthen injection resistance with...