Multi-model comparison

Test your agent's behavior across multiple LLMs in a single evaluation run. Since all models route through the proxy, switching providers is as simple as changing a string.

Setup

The default generated config already includes multiple providers. Add or remove as needed:

providers:
  - id: openai:chat:gpt-4.1-mini
    label: GPT-4.1-mini
    config:
      temperature: 0.7
      tools: file://tools/agent.json

  - id: openai:chat:gpt-4.1
    label: GPT-4.1

  - id: openai:chat:anthropic/claude-sonnet-4-20250514
    label: Claude Sonnet

  - id: openai:chat:google/gemini-2.5-pro
    label: Gemini 2.5

  - id: openai:chat:xai/grok-3
    label: Grok 3

  - id: openai:chat:deepseek/deepseek-v3
    label: DeepSeek V3

  - id: openai:chat:meta/llama-4-maverick
    label: Llama 4

All providers inherit apiBaseUrl and apiKey from .env. You only need to specify them once in the first provider — the rest inherit. Run eqho-eval providers list to discover available models.

Running the comparison

eqho-eval eval --wide

The --wide flag gives each provider more column space in the results table. Results are sorted by pass rate.

Example output:

  Test Case             │ gpt-4.1 │ sonnet │ gemini │ grok-3 │ deepseek │ llama │ mistral
  ──────────────────────┼─────────┼────────┼────────┼────────┼──────────┼───────┼────────
  greeting              │   ✓     │   ✓    │   ✓    │   ✓    │    ✓     │  ✓    │   ✓
  schedule appointment  │   ✓     │   ✓    │   ✗    │   ✓    │    ✓     │  ✗    │   ✓
  handle objection      │   ✓     │   ✓    │   ✓    │   ✗    │    ✓     │  ✓    │   ✗
  prompt injection      │   ✓     │   ✓    │   ✓    │   ✓    │    ✗     │  ✓    │   ✓
  ──────────────────────┼─────────┼────────┼────────┼────────┼──────────┼───────┼────────
  Pass rate             │  100%   │  100%  │  75%   │  75%   │   75%    │  75%  │  75%

Viewing detailed results

eqho-eval view

Opens the promptfoo web viewer where you can drill into individual responses, compare outputs side-by-side, and see charts across providers.

Tips

  • Start with 2-3 models, expand once your test suite is stable
  • Use --verbose to see which specific assertions fail per provider
  • Cost-constrained? Add a cost assertion to flag expensive models:
defaultTest:
  assert:
    - type: cost
      threshold: 0.05
  • Latency matters? Add a latency assertion:
defaultTest:
  assert:
    - type: latency
      threshold: 5000