Multi-turn conversations

Test your Eqho agents in realistic, back-and-forth conversations using promptfoo's simulated-user provider.

How it works

The promptfoo:simulated-user provider creates a loop between two models:

A simulated user (controlled by promptfoo) follows persona instructions you define
Your agent (the model being tested) responds naturally

They go back and forth until maxTurns is reached or the agent emits ###STOP###.

Simulated User  ──▶  Agent
     ◀──  responds  ──┘
     ──▶  follows up ──▶  Agent
     ◀──  responds  ──┘
     ...until maxTurns or ###STOP###

Quick start: generate from real calls

The fastest way to create multi-turn tests is from real Eqho conversations:

eqho-eval conversations --multi-turn

This pulls recent calls, analyzes each caller's behavior, and generates promptfoo:simulated-user test cases automatically.

Options

| Flag | Description | |------|-------------| | -m, --multi-turn | Generate multi-turn tests (required) | | -n, --last <count> | Number of calls to pull (default: 25) | | --max-turns <n> | Override auto-calculated max turns per test | | --seed-turns <n> | Include first N exchanges as initialMessages |

Examples

# Pull last 25 calls, auto-detect turn count
eqho-eval conversations --multi-turn

# Pull 10 calls, cap at 6 turns each
eqho-eval conversations -m -n 10 --max-turns 6

# Seed with first 3 exchanges for mid-conversation testing
eqho-eval conversations -m --seed-turns 3

The generated tests/conversations.yaml contains one test per call, each with:

instructions describing the caller's persona and behavior
message set to the caller's first message
assert with an llm-rubric based on the call's actual disposition

Quick start: generate from scratch

When initializing a new eval project, use the --multi-turn flag:

eqho-eval init --campaign <id> --multi-turn

This generates starter scenarios (interested lead, skeptical lead, firm refusal, mid-conversation resume) with promptfoo:simulated-user wired into defaultTest.

Writing custom scenarios

Basic pattern

Set the simulated-user provider on defaultTest and define each scenario in tests:

defaultTest:
  provider:
    id: 'promptfoo:simulated-user'
    config:
      maxTurns: 8

tests:
  - vars:
      instructions: |
        You are Alex Thompson. You called because you want to schedule
        a demo. Be cooperative but ask about pricing before committing.
        You are reactive — only respond to what the agent asks.
      message: "Hi, I'd like to learn more about your service."
    assert:
      - type: llm-rubric
        value: "The agent should qualify the lead and successfully schedule a demo."

Mid-conversation resume with initialMessages

Skip the early turns and test how your agent handles a specific point in the conversation:

tests:
  - vars:
      instructions: |
        You want to pay with your credit card ending in 4242.
        Provide details when asked.
      message: "I'd like to pay with my credit card."
      initialMessages:
        - role: user
          content: "Hi, I'd like to learn about your service."
        - role: assistant
          content: "I'd be happy to help! Can I get your name?"
        - role: user
          content: "It's Alex Thompson."
        - role: assistant
          content: "Great, Alex! Let me walk you through our options."
    assert:
      - type: llm-rubric
        value: "The agent should handle payment smoothly."

maxTurns only counts NEW turns after initialMessages — the seed messages don't count.

Loading initialMessages from files

For longer conversation histories, reference a JSON file:

tests:
  - vars:
      initialMessages: file://tests/transcripts/call-abc123.json
      instructions: "Continue the conversation. You want to reschedule."

Tool call testing in multi-turn

The problem with default tool callbacks

promptfoo's built-in functionToolCallbacks have a known limitation: when a model makes a tool call, the stub's return value is surfaced directly as the assistant's response instead of being sent back to the model for a natural-language wrap-up. This causes raw JSON or truncated conversations.

The fix: eqho custom provider

eqho-eval init --multi-turn generates a custom provider (providers/eqho-agent.js) that wraps the proxy and handles the full tool call loop:

Model responds with a tool call
Provider executes the stub locally
Tool result is appended to the message history
Model is called again to produce a natural response
Repeats up to 5 rounds if the model chains tool calls

providers:
  - id: file://providers/eqho-agent.js:eqho:gpt-4.1
    label: GPT-4.1
    config:
      apiBaseUrl: https://your-proxy.vercel.app/api/v1
      apiKey: your-token
      tools: file://tools/agent-tools.json
      temperature: 0.7

Writing tool stubs

Stubs should return natural language strings, not JSON objects. This ensures the tool result reads naturally if the model echoes it:

module.exports = {
  schedule_appointment: (args) => {
    let parsed = args;
    if (typeof args === "string") {
      try { parsed = JSON.parse(args); } catch { parsed = {}; }
    }
    return `Appointment confirmed for ${parsed.date || "the requested time"}. Calendar invite sent.`;
  },
};

The eqho-eval init --multi-turn command generates these stubs automatically from your agent's tool definitions.

Tips

Sizing maxTurns: There are two levels of turn control that work together:

defaultTest.provider.config.maxTurns — the hard ceiling set in your promptfooconfig.yaml. This is the maximum number of exchanges the simulated user will send before stopping. Set this to a generous upper bound (e.g., 10-15).
Per-test assertion turn count — the auto-generated llm-rubric assertion that says "ideally within N turns." This is a soft quality signal, not a hard cutoff. The grader will penalize conversations that run too long without resolution, but won't fail them if they're still productive.

When using eqho-eval conversations --multi-turn, the per-test turn target is calculated as 2x the original call's user turn count (capped at 15). You can override the global ceiling with --max-turns.

Writing good instructions: Be specific about the persona's goals, constraints, and communication style. Always include both:

"You are reactive — only respond to what the agent asks" to prevent the simulated user from leading
"After your main question is resolved, begin wrapping up" to prevent endless tangents

The auto-generated tests include both directives automatically.

Assertions: eqho-eval generates weighted assertions per test. Higher weight = more impact on the overall pass/fail score:

| Assertion | Weight | Metric | Purpose | |-----------|--------|--------|---------| | not-icontains: [object Object] | 3.0 | no-json-leak | Catches serialization bugs in tool call handling | | Tone rubric | 2.0 | tone | Warm, professional, not robotic | | Role adherence rubric | 1.5 | role-adherence | Agent stays in character, doesn't roleplay as human | | Disposition rubric | 2.0 | disposition-outcome | Outcome matches expected behavior (schedule, decline, etc.) | | Knowledge retention | 1.5 | knowledge-retention | Agent remembers user's name, issue, details | | Closing attempt rubric | 0.5 | closing-attempt | Agent tries to wrap up naturally | | Turn count rubric | 0.5 | turn-efficiency | Conversation reaches conclusion in reasonable turns | | Latency | 0 (info-only) | latency | Response time under 60s |

Test threshold: Each auto-generated test has threshold: 0.6, meaning a test passes if its weighted assertion score is ≥ 60%. This prevents a single low-priority failure from failing the entire test.

Debugging: Set LOG_LEVEL=debug to see every message exchanged:

LOG_LEVEL=debug eqho-eval eval

Combining with single-turn: You can mix multi-turn and single-turn tests in the same config by setting the provider on individual tests rather than defaultTest.