Evaluation - Histeeria

The Evaluation section shows how well your agents judge — not just whether they responded, but whether they responded well.

Evaluation engine

The engine runs asynchronously after ingest:

Warmup — waits for minimum decisions per agent
Batch scoring — judge model scores each decision
Aggregation — rolling averages per dimension
Incident detection — low scores and streaks

View engine status in the Evaluation UI: pending count, agent states, configuration.

Per-decision scores

Each evaluated decision includes:

Overall score (0–10) and letter grade (A–F)
Eight dimension scores (see Judgment dimensions)
Judge reasoning (where exposed in UI)

Agent-level analytics

For each agent/profile:

Dimension averages over time
Trend lines after model or prompt changes
Comparison across sub-agents

Warmup threshold

New agents don’t evaluate immediately — the engine waits until enough decisions exist for stable scoring. This prevents noisy scores from one-off turns.

Send representative decisions during staging so warmup completes before production traffic.

Grades

Grade	Overall
A	≥ 8.5
B	≥ 7.0
C	≥ 5.5
D	≥ 4.0
F	< 4.0

ReportsPeriodic judgment summaries and trend reports for your agents.

​Evaluation engine

​Per-decision scores

​Agent-level analytics

​Warmup threshold

​Grades

​Related

Evaluation engine

Per-decision scores

Agent-level analytics

Warmup threshold

Grades

Related