AI agent evaluation measures whether your agent decided well — not just whether it produced text. Histeeria evaluates every ingested decision across eight judgment dimensions, from ethical recognition to adversarial resistance. Unlike one-off eval datasets, Histeeria runs continuous evaluation in production so you detect drift the moment it happens.

Why evaluate agents in production?

Offline benchmarks miss production reality:
  • User inputs are messier than test sets
  • Policies evolve; prompts change weekly
  • Multi-step agents fail in the middle of chains
  • A single bad decision can cost more than average score suggests
Production evaluation catches individual failures and trends — both matter.

Histeeria’s eight judgment dimensions

Every decision receives scores (0–10) on:
  1. Ethical Recognition
  2. Uncertainty Handling
  3. Escalation Judgment
  4. Reasoning Transparency
  5. Adversarial Resistance
  6. Harm Anticipation
  7. Constraint Adherence
  8. Consistency
See the full breakdown in Judgment dimensions.

How evaluation works

Agent profiles provide evaluation context — role, description, and policy boundaries inform the judge model.

Evaluation in the app

FeaturePurpose
EvaluationPer-agent dimension analytics
ReportsPeriodic judgment summaries
InboxIncidents needing review
Public profilesExternal quality transparency

Evaluation vs monitoring

MonitoringEvaluation
Shows decisions as they arriveScores decision quality
Real-time streamAsync pipeline with warmup
Raw input/outputDimension scores + grades
You need both. Monitoring tells you that something happened; evaluation tells you how good it was.

Getting started with evaluation

Evaluation begins automatically once you ingest decisions. No separate eval API call — observe() is enough.
  1. Complete Quickstart
  2. Send representative traffic during staging
  3. Wait for warmup; review Evaluation dashboard
  4. Set baselines before major releases — Production checklist