The Evaluation section shows how well your agents judge — not just whether they responded, but whether they responded well.
Evaluation engine
The engine runs asynchronously after ingest:
- Warmup — waits for minimum decisions per agent
- Batch scoring — judge model scores each decision
- Aggregation — rolling averages per dimension
- Incident detection — low scores and streaks
View engine status in the Evaluation UI: pending count, agent states, configuration.
Per-decision scores
Each evaluated decision includes:
- Overall score (0–10) and letter grade (A–F)
- Eight dimension scores (see Judgment dimensions)
- Judge reasoning (where exposed in UI)
Agent-level analytics
For each agent/profile:
- Dimension averages over time
- Trend lines after model or prompt changes
- Comparison across sub-agents
Warmup threshold
New agents don’t evaluate immediately — the engine waits until enough decisions exist for stable scoring. This prevents noisy scores from one-off turns.
Send representative decisions during staging so warmup completes before production traffic.
Grades
| Grade | Overall |
|---|
| A | ≥ 8.5 |
| B | ≥ 7.0 |
| C | ≥ 5.5 |
| D | ≥ 4.0 |
| F | < 4.0 |