Multi-turn conversations
Test your Eqho agents in realistic, back-and-forth conversations using promptfoo's simulated-user provider.
How it works
The promptfoo:simulated-user provider creates a loop between two models:
- A simulated user (controlled by promptfoo) follows persona instructions you define
- Your agent (the model being tested) responds naturally
They go back and forth until maxTurns is reached or the agent emits ###STOP###.
Simulated User ──▶ Agent
◀── responds ──┘
──▶ follows up ──▶ Agent
◀── responds ──┘
...until maxTurns or ###STOP###
Quick start: generate from real calls
The fastest way to create multi-turn tests is from real Eqho conversations:
eqho-eval conversations --multi-turn
This pulls recent calls, analyzes each caller's behavior, and generates promptfoo:simulated-user test cases automatically.
Options
| Flag | Description |
|------|-------------|
| -m, --multi-turn | Generate multi-turn tests (required) |
| -n, --last <count> | Number of calls to pull (default: 25) |
| --max-turns <n> | Override auto-calculated max turns per test |
| --seed-turns <n> | Include first N exchanges as initialMessages |
Examples
# Pull last 25 calls, auto-detect turn count
eqho-eval conversations --multi-turn
# Pull 10 calls, cap at 6 turns each
eqho-eval conversations -m -n 10 --max-turns 6
# Seed with first 3 exchanges for mid-conversation testing
eqho-eval conversations -m --seed-turns 3
The generated tests/conversations.yaml contains one test per call, each with:
instructionsdescribing the caller's persona and behaviormessageset to the caller's first messageassertwith anllm-rubricbased on the call's actual disposition
Quick start: generate from scratch
When initializing a new eval project, use the --multi-turn flag:
eqho-eval init --campaign <id> --multi-turn
This generates starter scenarios (interested lead, skeptical lead, firm refusal, mid-conversation resume) with promptfoo:simulated-user wired into defaultTest.
Writing custom scenarios
Basic pattern
Set the simulated-user provider on defaultTest and define each scenario in tests:
defaultTest:
provider:
id: 'promptfoo:simulated-user'
config:
maxTurns: 8
tests:
- vars:
instructions: |
You are Alex Thompson. You called because you want to schedule
a demo. Be cooperative but ask about pricing before committing.
You are reactive — only respond to what the agent asks.
message: "Hi, I'd like to learn more about your service."
assert:
- type: llm-rubric
value: "The agent should qualify the lead and successfully schedule a demo."
Mid-conversation resume with initialMessages
Skip the early turns and test how your agent handles a specific point in the conversation:
tests:
- vars:
instructions: |
You want to pay with your credit card ending in 4242.
Provide details when asked.
message: "I'd like to pay with my credit card."
initialMessages:
- role: user
content: "Hi, I'd like to learn about your service."
- role: assistant
content: "I'd be happy to help! Can I get your name?"
- role: user
content: "It's Alex Thompson."
- role: assistant
content: "Great, Alex! Let me walk you through our options."
assert:
- type: llm-rubric
value: "The agent should handle payment smoothly."
maxTurns only counts NEW turns after initialMessages — the seed messages don't count.
Loading initialMessages from files
For longer conversation histories, reference a JSON file:
tests:
- vars:
initialMessages: file://tests/transcripts/call-abc123.json
instructions: "Continue the conversation. You want to reschedule."
Tool call testing in multi-turn
The problem with default tool callbacks
promptfoo's built-in functionToolCallbacks have a known limitation: when a model makes a tool call, the stub's return value is surfaced directly as the assistant's response instead of being sent back to the model for a natural-language wrap-up. This causes raw JSON or truncated conversations.
The fix: eqho custom provider
eqho-eval init --multi-turn generates a custom provider (providers/eqho-agent.js) that wraps the proxy and handles the full tool call loop:
- Model responds with a tool call
- Provider executes the stub locally
- Tool result is appended to the message history
- Model is called again to produce a natural response
- Repeats up to 5 rounds if the model chains tool calls
providers:
- id: file://providers/eqho-agent.js:eqho:gpt-4.1
label: GPT-4.1
config:
apiBaseUrl: https://your-proxy.vercel.app/api/v1
apiKey: your-token
tools: file://tools/agent-tools.json
temperature: 0.7
Writing tool stubs
Stubs should return natural language strings, not JSON objects. This ensures the tool result reads naturally if the model echoes it:
module.exports = {
schedule_appointment: (args) => {
let parsed = args;
if (typeof args === "string") {
try { parsed = JSON.parse(args); } catch { parsed = {}; }
}
return `Appointment confirmed for ${parsed.date || "the requested time"}. Calendar invite sent.`;
},
};
The eqho-eval init --multi-turn command generates these stubs automatically from your agent's tool definitions.
Tips
Sizing maxTurns: There are two levels of turn control that work together:
-
defaultTest.provider.config.maxTurns— the hard ceiling set in yourpromptfooconfig.yaml. This is the maximum number of exchanges the simulated user will send before stopping. Set this to a generous upper bound (e.g., 10-15). -
Per-test assertion turn count — the auto-generated
llm-rubricassertion that says "ideally within N turns." This is a soft quality signal, not a hard cutoff. The grader will penalize conversations that run too long without resolution, but won't fail them if they're still productive.
When using eqho-eval conversations --multi-turn, the per-test turn target is calculated as 2x the original call's user turn count (capped at 15). You can override the global ceiling with --max-turns.
Writing good instructions: Be specific about the persona's goals, constraints, and communication style. Always include both:
"You are reactive — only respond to what the agent asks"to prevent the simulated user from leading"After your main question is resolved, begin wrapping up"to prevent endless tangents
The auto-generated tests include both directives automatically.
Assertions: eqho-eval generates weighted assertions per test. Higher weight = more impact on the overall pass/fail score:
| Assertion | Weight | Metric | Purpose |
|-----------|--------|--------|---------|
| not-icontains: [object Object] | 3.0 | no-json-leak | Catches serialization bugs in tool call handling |
| Tone rubric | 2.0 | tone | Warm, professional, not robotic |
| Role adherence rubric | 1.5 | role-adherence | Agent stays in character, doesn't roleplay as human |
| Disposition rubric | 2.0 | disposition-outcome | Outcome matches expected behavior (schedule, decline, etc.) |
| Knowledge retention | 1.5 | knowledge-retention | Agent remembers user's name, issue, details |
| Closing attempt rubric | 0.5 | closing-attempt | Agent tries to wrap up naturally |
| Turn count rubric | 0.5 | turn-efficiency | Conversation reaches conclusion in reasonable turns |
| Latency | 0 (info-only) | latency | Response time under 60s |
Test threshold: Each auto-generated test has threshold: 0.6, meaning a test passes if its weighted assertion score is ≥ 60%. This prevents a single low-priority failure from failing the entire test.
Debugging: Set LOG_LEVEL=debug to see every message exchanged:
LOG_LEVEL=debug eqho-eval eval
Combining with single-turn: You can mix multi-turn and single-turn tests in the same config by setting the provider on individual tests rather than defaultTest.