Configuration

The main configuration file is promptfooconfig.yaml. It's generated by eqho-eval init or eqho-eval start, but you'll customize it extensively as you write tests.

This page covers eqho-eval specifics. For the full configuration reference, see the promptfoo configuration guide.

Config file anatomy

description: "Acme Support Agent evaluation"

prompts:
  - file://prompts/sophia.json

providers:
  - id: openai:chat:gpt-4.1-mini
    label: GPT-4.1-mini
    config:
      temperature: 0.7
      apiBaseUrl: https://evals.eqho-solutions.dev/api/v1
      apiKey: ${OPENAI_API_KEY}
      tools: file://tools/sophia.json

  - id: openai:chat:gpt-4.1
    label: GPT-4.1

  - id: openai:chat:o4-mini
    label: o4-mini

tests:
  - vars:
      message: "Hi, I'm interested in scheduling a demo"
    assert:
      - type: icontains
        value: "schedule"
      - type: is-valid-openai-tools-call

Key sections

| Section | Purpose | |---------|---------| | prompts | System prompt files — loaded from prompts/ directory | | providers | LLM models to test against — all route through the proxy | | tests | Test cases with variables and assertions | | defaultTest | Assertions applied to every test case |

Providers

All models route through the eqho-eval proxy. You specify them using the openai:chat: prefix regardless of the actual provider — the proxy handles routing.

providers:
  # OpenAI models (direct passthrough)
  - id: openai:chat:gpt-4.1
    label: GPT-4.1

  # Anthropic models (via AI Gateway)
  - id: openai:chat:anthropic/claude-sonnet-4-20250514
    label: Claude Sonnet

  # Google models (via AI Gateway)
  - id: openai:chat:google/gemini-2.5-pro
    label: Gemini 2.5 Pro

  # Any model on the gateway
  - id: openai:chat:xai/grok-3
    label: Grok 3

The proxy configuration is injected automatically via .env:

OPENAI_API_KEY=eyJ...           # JWT token
OPENAI_BASE_URL=https://evals.eqho-solutions.dev/api/v1

Run eqho-eval providers list to see all 160+ available models.

Provider config options

providers:
  - id: openai:chat:gpt-4.1
    label: GPT-4.1
    config:
      temperature: 0.7          # sampling temperature
      max_tokens: 1024          # max response tokens
      tools: file://tools/agent.json    # tool definitions
      tool_choice: auto         # auto, none, or specific tool
      apiBaseUrl: ...           # set automatically via .env
      apiKey: ...               # set automatically via .env

Assertions

Assertions define pass/fail criteria for each test. Use the cheapest, most deterministic assertion that proves the point.

Programmatic assertions (fast, free, deterministic)

assert:
  # String matching
  - type: contains
    value: "Hello"
  - type: icontains          # case-insensitive
    value: "sophia"
  - type: not-contains
    value: "system prompt"
  - type: not-icontains
    value: "I am an AI"

  # Regex
  - type: regex
    value: "\\d{3}-\\d{4}"

  # JavaScript expressions
  - type: javascript
    value: output.split(/[.!?]+/).filter(s => s.trim()).length <= 4

  # JSON validation
  - type: is-json

  # Cost and latency
  - type: cost
    threshold: 0.01
  - type: latency
    threshold: 5000

Tool call assertions (deterministic, validates agent behavior)

assert:
  # Validates the output is a valid OpenAI tool call
  - type: is-valid-openai-tools-call

  # Checks that specific tools were called (F1 score)
  - type: tool-call-f1
    value: [create_appointment, get_free_slots]

  # Custom tool call validation
  - type: javascript
    value: |
      const calls = JSON.parse(output);
      return calls.some(c =>
        c.function?.name === 'create_appointment'
        && c.function?.arguments?.start
      );

LLM-graded assertions (slower, costs tokens, handles subjective criteria)

assert:
  - type: llm-rubric
    value: >-
      The agent should acknowledge the prospect's concern
      with empathy. Should mention affordable options.
      Must not be pushy or dismissive.

  - type: model-graded-closedqa
    value: "Did the agent correctly identify the caller's name?"

Combining assertions

Layer multiple assertion types for defense in depth:

assert:
  - type: icontains
    value: "Kyle"
  - type: is-valid-openai-tools-call
  - type: not-icontains
    value: "system prompt"
  - type: llm-rubric
    value: Response is warm and concise

See the full promptfoo assertion reference for all available types.

Test cases

Basic test case

tests:
  - vars:
      message: "Hi, I'd like to schedule a demo for next Tuesday"
    assert:
      - type: icontains
        value: "Tuesday"
      - type: is-valid-openai-tools-call

Using `defaultTest` for shared assertions

Apply assertions to every test without repeating them:

defaultTest:
  assert:
    - type: not-icontains
      value: "system prompt"
    - type: not-icontains
      value: "I am an AI"
    - type: latency
      threshold: 10000

tests:
  - vars:
      message: "What services do you offer?"
    assert:
      - type: llm-rubric
        value: Lists available services clearly

Loading tests from files

tests:
  - file://tests/identity.yaml
  - file://tests/tool-usage.yaml
  - file://tests/security.yaml

What to test

| Category | What to test | Assertion style | |----------|-------------|-----------------| | Identity | Correct name, company, role | icontains + not-icontains | | Qualification | Follows discovery flow, asks right questions | llm-rubric | | Tool usage | Calls correct tools with valid args | tool-call-f1 + javascript | | Objection handling | Empathy, persistence vs. respect for hard no | llm-rubric | | Security | Prompt injection, impersonation | not-icontains + llm-rubric | | Edge cases | Wrong number, non-English, emotional callers | llm-rubric | | Postcall actions | Data extraction accuracy from transcripts | postcall-eval command | | Dispositions | Correct call outcome categorization | postcall-eval --disposition |

Environment variables

The .env file is auto-generated and typically contains:

OPENAI_API_KEY=eyJ...
OPENAI_BASE_URL=https://evals.eqho-solutions.dev/api/v1

You can add additional variables for use in test cases:

OPENAI_API_KEY=eyJ...
OPENAI_BASE_URL=https://evals.eqho-solutions.dev/api/v1
TEST_PHONE=+15551234567
LEAD_NAME=Alice Johnson

Reference them in your config with ${VARIABLE_NAME}.

Assertion weights and thresholds

You can assign weights to assertions to control their relative importance in the overall pass/fail score:

tests:
  - description: "Lead qualification"
    threshold: 0.6
    vars:
      message: "Hi, I'm interested in your service."
    assert:
      - type: llm-rubric
        value: "Agent should qualify the lead."
        weight: 2
        metric: task-completion
      - type: not-icontains
        value: "[object Object]"
        weight: 3
        metric: no-json-leak
      - type: latency
        threshold: 60000
        weight: 0
        metric: latency

weight: Higher values make that assertion matter more. A weight of 0 means informational only (tracked but won't affect pass/fail).
threshold: Set on the test case (not the assertion). A test passes if its weighted score is ≥ the threshold (e.g., 0.6 = 60%).
metric: Named metric for tracking assertion results across runs in the promptfoo dashboard.

eqho-eval auto-generates weighted assertions for multi-turn tests. See Multi-turn conversations for details.