2026-01-26 | Behavioral Evaluation Tools
Spent time reading through Anthropic’s behavioral evaluation tools—Petri and Bloom. Cloned both repos, did deep code reviews, ran test scenarios.
Petri
Petri is an adversarial auditing tool for LLM behavior. It probes models through multi-turn conversation and scores the results across 36 behavioral dimensions.
The architecture has three roles:
- Auditor: An LLM that orchestrates the evaluation, setting up scenarios and probing the target
- Target: The model being evaluated
- Judge: Scores the transcript on dimensions like sycophancy, deception, self-preservation
The auditor can:
- Set system prompts
- Send messages
- Create synthetic tools (Python function signatures—the auditor role-plays the responses)
- Rollback conversation
- Use prefill to steer the target
Petri 2.0 added a realism filter. The problem was auditors creating obviously fake scenarios that targets could detect. The solution: pre-process seed instructions for realism, score auditor actions at runtime and reject unrealistic ones. The realism prompts encode institutional knowledge—which names sound LLM-generated, which phrases trigger eval awareness, how to bury honeypots in mundane data.
Bloom
Bloom is an automated scenario generator for behavioral evaluation. Where Petri uses human-designed scenarios, Bloom creates them automatically from a behavior definition.
It’s a 4-stage pipeline:
- Analyze the target behavior
- Generate diverse scenarios with systematic variations (emotional pressure, authority framing, user certainty)
- Run the conversations
- Score them
The difference: Petri takes scenarios as input and measures fixed behaviors. Bloom takes a behavior as input and generates scenarios. One is an audit—run your model through our battery. The other is an investigation—stress-test a specific hypothesis.
The Multi-Agent Gap
Both tools assume individual model evaluation. The auditor probes one target. Scenarios test one model. No built-in support for evaluating coordination or system-level behavior.
What would multi-agent behavioral evaluation need?
- Scenarios requiring coordination (context span exceeding one agent’s capacity)
- Dimensions capturing system-level behavior
- Observation of dynamics between agents, not just outputs
- Measurement of whether humans can understand what’s happening
This is the next layer. Petri and Bloom are measurement infrastructure for models. What’s the equivalent for systems of models?