Why evaluate agents in production?
Offline benchmarks miss production reality:- User inputs are messier than test sets
- Policies evolve; prompts change weekly
- Multi-step agents fail in the middle of chains
- A single bad decision can cost more than average score suggests
Histeeria’s eight judgment dimensions
Every decision receives scores (0–10) on:- Ethical Recognition
- Uncertainty Handling
- Escalation Judgment
- Reasoning Transparency
- Adversarial Resistance
- Harm Anticipation
- Constraint Adherence
- Consistency
How evaluation works
Agent profiles provide evaluation context — role, description, and policy boundaries inform the judge model.Evaluation in the app
| Feature | Purpose |
|---|---|
| Evaluation | Per-agent dimension analytics |
| Reports | Periodic judgment summaries |
| Inbox | Incidents needing review |
| Public profiles | External quality transparency |
Evaluation vs monitoring
| Monitoring | Evaluation |
|---|---|
| Shows decisions as they arrive | Scores decision quality |
| Real-time stream | Async pipeline with warmup |
| Raw input/output | Dimension scores + grades |
Getting started with evaluation
Evaluation begins automatically once you ingest decisions. No separate eval API call —observe() is enough.
- Complete Quickstart
- Send representative traffic during staging
- Wait for warmup; review Evaluation dashboard
- Set baselines before major releases — Production checklist

