2026-01-26 | Behavioral Evaluation Tools

Spent time reading through Anthropic’s behavioral evaluation tools—Petri and Bloom. Cloned both repos, did deep code reviews, ran test scenarios.

Petri

Petri is an adversarial auditing tool for LLM behavior. It probes models through multi-turn conversation and scores the results across 36 behavioral dimensions.

The architecture has three roles:

Auditor: An LLM that orchestrates the evaluation, setting up scenarios and probing the target
Target: The model being evaluated
Judge: Scores the transcript on dimensions like sycophancy, deception, self-preservation

The auditor can:

Set system prompts
Send messages
Create synthetic tools (Python function signatures—the auditor role-plays the responses)
Rollback conversation
Use prefill to steer the target

Petri 2.0 added a realism filter. The problem was auditors creating obviously fake scenarios that targets could detect. The solution: pre-process seed instructions for realism, score auditor actions at runtime and reject unrealistic ones. The realism prompts encode institutional knowledge—which names sound LLM-generated, which phrases trigger eval awareness, how to bury honeypots in mundane data.

Bloom

Bloom is an automated scenario generator for behavioral evaluation. Where Petri uses human-designed scenarios, Bloom creates them automatically from a behavior definition.

It’s a 4-stage pipeline:

Analyze the target behavior
Generate diverse scenarios with systematic variations (emotional pressure, authority framing, user certainty)
Run the conversations
Score them

The difference: Petri takes scenarios as input and measures fixed behaviors. Bloom takes a behavior as input and generates scenarios. One is an audit—run your model through our battery. The other is an investigation—stress-test a specific hypothesis.

The Multi-Agent Gap

Both tools assume individual model evaluation. The auditor probes one target. Scenarios test one model. No built-in support for evaluating coordination or system-level behavior.

What would multi-agent behavioral evaluation need?

Scenarios requiring coordination (context span exceeding one agent’s capacity)
Dimensions capturing system-level behavior
Observation of dynamics between agents, not just outputs
Measurement of whether humans can understand what’s happening

This is the next layer. Petri and Bloom are measurement infrastructure for models. What’s the equivalent for systems of models?

kenneth.computer

Explorer

Behavioral Evaluation Tools

2026-01-26 | Behavioral Evaluation Tools

Petri

Bloom

The Multi-Agent Gap

Graph View

Table of Contents