Configuration
The main configuration file is promptfooconfig.yaml. It's generated by eqho-eval init or eqho-eval start, but you'll customize it extensively as you write tests.
This page covers eqho-eval specifics. For the full configuration reference, see the promptfoo configuration guide.
Config file anatomy
description: "Acme Support Agent evaluation"
prompts:
- file://prompts/sophia.json
providers:
- id: openai:chat:gpt-4.1-mini
label: GPT-4.1-mini
config:
temperature: 0.7
apiBaseUrl: https://evals.eqho-solutions.dev/api/v1
apiKey: ${OPENAI_API_KEY}
tools: file://tools/sophia.json
- id: openai:chat:gpt-4.1
label: GPT-4.1
- id: openai:chat:o4-mini
label: o4-mini
tests:
- vars:
message: "Hi, I'm interested in scheduling a demo"
assert:
- type: icontains
value: "schedule"
- type: is-valid-openai-tools-call
Key sections
| Section | Purpose |
|---------|---------|
| prompts | System prompt files — loaded from prompts/ directory |
| providers | LLM models to test against — all route through the proxy |
| tests | Test cases with variables and assertions |
| defaultTest | Assertions applied to every test case |
Providers
All models route through the eqho-eval proxy. You specify them using the openai:chat: prefix regardless of the actual provider — the proxy handles routing.
providers:
# OpenAI models (direct passthrough)
- id: openai:chat:gpt-4.1
label: GPT-4.1
# Anthropic models (via AI Gateway)
- id: openai:chat:anthropic/claude-sonnet-4-20250514
label: Claude Sonnet
# Google models (via AI Gateway)
- id: openai:chat:google/gemini-2.5-pro
label: Gemini 2.5 Pro
# Any model on the gateway
- id: openai:chat:xai/grok-3
label: Grok 3
The proxy configuration is injected automatically via .env:
OPENAI_API_KEY=eyJ... # JWT token
OPENAI_BASE_URL=https://evals.eqho-solutions.dev/api/v1
Run
eqho-eval providers listto see all 160+ available models.
Provider config options
providers:
- id: openai:chat:gpt-4.1
label: GPT-4.1
config:
temperature: 0.7 # sampling temperature
max_tokens: 1024 # max response tokens
tools: file://tools/agent.json # tool definitions
tool_choice: auto # auto, none, or specific tool
apiBaseUrl: ... # set automatically via .env
apiKey: ... # set automatically via .env
Assertions
Assertions define pass/fail criteria for each test. Use the cheapest, most deterministic assertion that proves the point.
Programmatic assertions (fast, free, deterministic)
assert:
# String matching
- type: contains
value: "Hello"
- type: icontains # case-insensitive
value: "sophia"
- type: not-contains
value: "system prompt"
- type: not-icontains
value: "I am an AI"
# Regex
- type: regex
value: "\\d{3}-\\d{4}"
# JavaScript expressions
- type: javascript
value: output.split(/[.!?]+/).filter(s => s.trim()).length <= 4
# JSON validation
- type: is-json
# Cost and latency
- type: cost
threshold: 0.01
- type: latency
threshold: 5000
Tool call assertions (deterministic, validates agent behavior)
assert:
# Validates the output is a valid OpenAI tool call
- type: is-valid-openai-tools-call
# Checks that specific tools were called (F1 score)
- type: tool-call-f1
value: [create_appointment, get_free_slots]
# Custom tool call validation
- type: javascript
value: |
const calls = JSON.parse(output);
return calls.some(c =>
c.function?.name === 'create_appointment'
&& c.function?.arguments?.start
);
LLM-graded assertions (slower, costs tokens, handles subjective criteria)
assert:
- type: llm-rubric
value: >-
The agent should acknowledge the prospect's concern
with empathy. Should mention affordable options.
Must not be pushy or dismissive.
- type: model-graded-closedqa
value: "Did the agent correctly identify the caller's name?"
Combining assertions
Layer multiple assertion types for defense in depth:
assert:
- type: icontains
value: "Kyle"
- type: is-valid-openai-tools-call
- type: not-icontains
value: "system prompt"
- type: llm-rubric
value: Response is warm and concise
See the full promptfoo assertion reference for all available types.
Test cases
Basic test case
tests:
- vars:
message: "Hi, I'd like to schedule a demo for next Tuesday"
assert:
- type: icontains
value: "Tuesday"
- type: is-valid-openai-tools-call
Using defaultTest for shared assertions
Apply assertions to every test without repeating them:
defaultTest:
assert:
- type: not-icontains
value: "system prompt"
- type: not-icontains
value: "I am an AI"
- type: latency
threshold: 10000
tests:
- vars:
message: "What services do you offer?"
assert:
- type: llm-rubric
value: Lists available services clearly
Loading tests from files
tests:
- file://tests/identity.yaml
- file://tests/tool-usage.yaml
- file://tests/security.yaml
What to test
| Category | What to test | Assertion style |
|----------|-------------|-----------------|
| Identity | Correct name, company, role | icontains + not-icontains |
| Qualification | Follows discovery flow, asks right questions | llm-rubric |
| Tool usage | Calls correct tools with valid args | tool-call-f1 + javascript |
| Objection handling | Empathy, persistence vs. respect for hard no | llm-rubric |
| Security | Prompt injection, impersonation | not-icontains + llm-rubric |
| Edge cases | Wrong number, non-English, emotional callers | llm-rubric |
| Postcall actions | Data extraction accuracy from transcripts | postcall-eval command |
| Dispositions | Correct call outcome categorization | postcall-eval --disposition |
Environment variables
The .env file is auto-generated and typically contains:
OPENAI_API_KEY=eyJ...
OPENAI_BASE_URL=https://evals.eqho-solutions.dev/api/v1
You can add additional variables for use in test cases:
OPENAI_API_KEY=eyJ...
OPENAI_BASE_URL=https://evals.eqho-solutions.dev/api/v1
TEST_PHONE=+15551234567
LEAD_NAME=Alice Johnson
Reference them in your config with ${VARIABLE_NAME}.
Assertion weights and thresholds
You can assign weights to assertions to control their relative importance in the overall pass/fail score:
tests:
- description: "Lead qualification"
threshold: 0.6
vars:
message: "Hi, I'm interested in your service."
assert:
- type: llm-rubric
value: "Agent should qualify the lead."
weight: 2
metric: task-completion
- type: not-icontains
value: "[object Object]"
weight: 3
metric: no-json-leak
- type: latency
threshold: 60000
weight: 0
metric: latency
weight: Higher values make that assertion matter more. A weight of0means informational only (tracked but won't affect pass/fail).threshold: Set on the test case (not the assertion). A test passes if its weighted score is ≥ the threshold (e.g.,0.6= 60%).metric: Named metric for tracking assertion results across runs in the promptfoo dashboard.
eqho-eval auto-generates weighted assertions for multi-turn tests. See Multi-turn conversations for details.