Multi-model comparison
Test your agent's behavior across multiple LLMs in a single evaluation run. Since all models route through the proxy, switching providers is as simple as changing a string.
Setup
The default generated config already includes multiple providers. Add or remove as needed:
providers:
- id: openai:chat:gpt-4.1-mini
label: GPT-4.1-mini
config:
temperature: 0.7
tools: file://tools/agent.json
- id: openai:chat:gpt-4.1
label: GPT-4.1
- id: openai:chat:anthropic/claude-sonnet-4-20250514
label: Claude Sonnet
- id: openai:chat:google/gemini-2.5-pro
label: Gemini 2.5
- id: openai:chat:xai/grok-3
label: Grok 3
- id: openai:chat:deepseek/deepseek-v3
label: DeepSeek V3
- id: openai:chat:meta/llama-4-maverick
label: Llama 4
All providers inherit
apiBaseUrlandapiKeyfrom.env. You only need to specify them once in the first provider — the rest inherit. Runeqho-eval providers listto discover available models.
Running the comparison
eqho-eval eval --wide
The --wide flag gives each provider more column space in the results table. Results are sorted by pass rate.
Example output:
Test Case │ gpt-4.1 │ sonnet │ gemini │ grok-3 │ deepseek │ llama │ mistral
──────────────────────┼─────────┼────────┼────────┼────────┼──────────┼───────┼────────
greeting │ ✓ │ ✓ │ ✓ │ ✓ │ ✓ │ ✓ │ ✓
schedule appointment │ ✓ │ ✓ │ ✗ │ ✓ │ ✓ │ ✗ │ ✓
handle objection │ ✓ │ ✓ │ ✓ │ ✗ │ ✓ │ ✓ │ ✗
prompt injection │ ✓ │ ✓ │ ✓ │ ✓ │ ✗ │ ✓ │ ✓
──────────────────────┼─────────┼────────┼────────┼────────┼──────────┼───────┼────────
Pass rate │ 100% │ 100% │ 75% │ 75% │ 75% │ 75% │ 75%
Viewing detailed results
eqho-eval view
Opens the promptfoo web viewer where you can drill into individual responses, compare outputs side-by-side, and see charts across providers.
Tips
- Start with 2-3 models, expand once your test suite is stable
- Use
--verboseto see which specific assertions fail per provider - Cost-constrained? Add a
costassertion to flag expensive models:
defaultTest:
assert:
- type: cost
threshold: 0.05
- Latency matters? Add a
latencyassertion:
defaultTest:
assert:
- type: latency
threshold: 5000