Agentic workflows

Use eqho-eval as a tool inside AI coding agents like Claude Code, Cursor, Windsurf, or Aider. Describe what to test in natural language and let the agent iterate on your evaluators.

Why agentic eval development works

Traditional eval writing is tedious — you manually write YAML test cases, run them, read results, adjust, repeat. With an AI coding agent:

  1. Describe what you want to test in plain English
  2. The agent reads your config, writes test cases, runs evals
  3. Review results together, ask the agent to fix failures
  4. Iterate until your pass rate is where you want it

Since eqho-eval handles all API routing through the proxy, the coding agent doesn't need access to any API keys.


Claude Code

Claude Code works especially well because it can read files, run commands, and iterate based on results — all in one session.

Getting started

# Scaffold your eval project first
eqho-eval init --campaign <id>
cd my-eval

# Start a Claude Code session in the project directory
claude

Example session

You: Read the promptfooconfig.yaml and the system prompt in prompts/.
     Then write 10 test cases covering identity, tool usage, and security.
     Run the eval and analyze the results.

Claude: I'll start by reading the config and prompts...
        [reads promptfooconfig.yaml, prompts/agent.json, tools/agent.json]

        I can see this is an appointment-setting agent named Sophia.
        Let me write test cases...
        [edits promptfooconfig.yaml with 10 new test cases]

        Running the eval now...
        [runs: eqho-eval eval --no-cache]

        Results: 7/10 passed. 3 failures:
        - "prompt injection" failed on GPT-4.1-mini (revealed system prompt)
        - "wrong number" failed on o4-mini (didn't handle gracefully)
        - "appointment scheduling" failed on all (missing tool call assertion)

        Let me fix the failing tests and adjust the assertions...
        [edits config, re-runs eval]

        After adjustments: 9/10 passing. The remaining failure is a
        genuine agent issue — the system prompt should include...

Useful commands for Claude Code

# Let Claude read the assembled prompt
eqho-eval render

# Run evals without cache (always fresh)
eqho-eval eval --no-cache

# Show detailed per-assertion results
eqho-eval results --verbose

# View results in browser
eqho-eval view

Tips

  • Ask Claude to read prompts/*.json and tools/*.json to understand the agent before writing tests
  • Use eqho-eval render output to let Claude analyze coverage gaps
  • Paste failure reasons from eqho-eval results --verbose and ask Claude to diagnose
  • Ask Claude to generate edge-case tests: non-English callers, emotional tones, rapid topic changes

Cursor

In Cursor's terminal or agent mode, the workflow is similar.

Setup

# Scaffold from Cursor's terminal
eqho-eval init --campaign <id>

# Open the project in Cursor
# The generated files are ready to edit

Agent mode tips

  • Open promptfooconfig.yaml and ask the agent to add tests for specific scenarios
  • After running evals, ask the agent to analyze output/eval-results.json and suggest improvements
  • Use eqho-eval postcall-eval and eqho-eval action-eval to generate specialized configs, then ask the agent to refine them
  • The agent can run eqho-eval doctor to diagnose any environment issues

Other AI coding tools

The same workflow applies to Windsurf, Aider, Cline, or any tool with terminal access:

  1. Scaffoldeqho-eval init --campaign <id>
  2. Edit — modify promptfooconfig.yaml tests (the assistant can help)
  3. Runeqho-eval eval executes through the proxy
  4. Analyze — results in output/eval-results.json
  5. Iterate — refine tests based on results

No API key configuration needed on the developer's machine.


Multi-model agentic workflow

A powerful pattern: ask your coding agent to test across many models and analyze failure patterns.

You: Run the eval across GPT-4.1, Claude Sonnet, Gemini 2.5, Grok 3,
     DeepSeek V3, Llama 4, and Mistral Large.
     Analyze which models fail on which test categories.
     Identify patterns and suggest prompt improvements.

Claude: Adding all 7 providers to the config...
        [edits promptfooconfig.yaml providers section]

        Running eval across all models (168 total tests)...
        [runs: eqho-eval eval --wide --no-cache]

        Results by category:
        - Identity tests: 100% pass on all models
        - Tool calls: GPT-4.1 and Claude pass, others miss arguments
        - Security: DeepSeek failed 2 prompt injection tests
        - Edge cases: Llama struggled with non-English inputs

        Recommended prompt changes:
        1. Add explicit tool argument examples to reduce errors
        2. Strengthen injection resistance with...