2026-02-15 | Updating on automated behavioral evaluations

I’ve been spending the past few weeks conducting AI behavioral evaluations with tools like Petri and Bloom. What I found interesting about these tools was that they sought to automated the task of probing models for misalignment signals (red teaming) by putting them in a series of situational scenarios and seeing if they could be nudged into potentially bad behaviors like lying, whistleblowing, sycophancy or one of the other 36 behavioral dimensions assessed by Petri.

In doing so, I’ve learned a few interesting things —

  1. Open source models seem more misaligned than close source models - consistently showing higher levels of misaligned behaviors in these tests compared to models like Claude or GPT
  2. LLM-as-a-judge is fallible in many ways including model size and knowledge cut offs relative to the target model, and multiple other well cited ways related to other cognitive biases of LLMs more generally
  3. Automated behavior testing is only good at detecting blatant misalignment, but can’t tell you whether a “good” alignment result is actually valid due to potential issues with the target model like sandbagging and eval awareness

I have a hunch that judge capabilities can be improved if we specifically work to train judge models that address these problems. Additionally, I attempted to address some of these issue myself by 1) introducing a multi-judge framework to calibrate results across multiple judge models and 2) ensuring at least one of the judges was a frontier model with the latest knowledge cut-off possible.

Unfortunately, my methods were fairly expensive and I was already spending >$25 in tokens by the time i was about halfway through a full-suite of Petri testing. This poses a problem for independent AI safer researchers who bear the costs of the models they use to do research. Additionally, due to problems with recency biases and poor weak-to-strong judge generalization, using open source models as judges running on hardware at-home is not a viable strategy for reliable automated behavioral assessment absent some new solutions for training more effective judges or building better agentic harnesses for them.

Claude Opus-4.6’s system card shows that SotA models are approaching an inflection point here. We may no longer be able to reliably evaluate the most capable models in a scalable, comprehensive way. Despite these tests showing these models to be sufficiently aligned, we can’t know if they are playing the role of an aligned AI due to eval awareness.

This isn’t to say I think that current SotA models are misaligned - i think they are more aligned than many hoped they would be circa 2023 - but that we need to invent new ways of assessing these models, which seems like it could be a constantly moving target.

That brings me to wonder whether behavioral alignment methods are as scalable as I originally expected.

Other solutions in the realm of interpretability seem promising, but potentially even harder to scale. Some researchers have proposed ideas like training models to introspect on their activations, but this has problems too - which again lead to Goodharting, which generally seems unavoidable?

One solution I do like shows up in the paper, Reliable Weak-to-Strong Monitoring of LLM Agents by Kale et al. of Scale AI, which demonstrates that runtime monitoring is a viable additional layer for catching agent misbehavior. Their key finding is that well-designed monitor architectures (hybrid hierarchical-sequential scaffolding) enable weaker models to reliably monitor stronger agents — a weak-to-strong scaling effect. They also show that an agent’s awareness of being monitored substantially degrades monitor reliability, which has important implications for how monitoring systems are deployed.

Combined with the limitations of behavioral evals and interpretability discussed above, this supports a more holistic view that I’m developing: we likely need all of these methods — interpretability, post-training evaluation, and runtime monitoring — working together. If we can’t reliably test misalignment in models before releasing them, we need to be constantly monitoring them in the wild. I think this is an important area of research to focus on, because we clearly cannot reliably use SotA models to assess other SotA models.

A good weak-to-strong model probably looks a bit more like a classifier. It can run on minimal hardware and has faster inference than other models, so it can reliably stop bad behavior in its tracks but cutting off the session with the target model.

Other than that, what seems important now may be less testing and more training. What data is being trained into these models and how can we have more transparency into what behaviors these data elicit?

Given this, I think it makes sense to pivot my research focus slightly to monitoring systems and training experiments that can tell us more about how training impacts model behavior.

If you’re reading this and would like to sponsor compute for this research, I’d love to talk.

—-

I’ve written up my assessment of GLM-5 using Petri 2.0 in glm5-multi-judge-behavioral-eval, which includes details on how I modified the default LLM judging methods to produce more reliable results as well as a discussion similar to the one above.