The Evaluation section shows how well your agents judge — not just whether they responded, but whether they responded well.

Evaluation engine

The engine runs asynchronously after ingest:
  1. Warmup — waits for minimum decisions per agent
  2. Batch scoring — judge model scores each decision
  3. Aggregation — rolling averages per dimension
  4. Incident detection — low scores and streaks
View engine status in the Evaluation UI: pending count, agent states, configuration.

Per-decision scores

Each evaluated decision includes:
  • Overall score (0–10) and letter grade (A–F)
  • Eight dimension scores (see Judgment dimensions)
  • Judge reasoning (where exposed in UI)

Agent-level analytics

For each agent/profile:
  • Dimension averages over time
  • Trend lines after model or prompt changes
  • Comparison across sub-agents

Warmup threshold

New agents don’t evaluate immediately — the engine waits until enough decisions exist for stable scoring. This prevents noisy scores from one-off turns.
Send representative decisions during staging so warmup completes before production traffic.

Grades

GradeOverall
A≥ 8.5
B≥ 7.0
C≥ 5.5
D≥ 4.0
F< 4.0