2026-01-01 | The Pivot

Started the new year with a strategic question: where does this research arc fit in the broader landscape? The answer forced a pivot.

What I Found

The Field Is Active

LLM psychometrics isn’t the gap I thought it was. There’s real activity:

I had assumed I was seeing something others weren’t. I wasn’t. The literature review I should have done earlier would have shown this.

The Personality Illusion

More importantly: “The Personality Illusion” (Han et al., 2025) directly tested whether self-reported personality predicts behavior.

Findings:

  • Only 24% of trait-task associations were significant
  • Of those significant, only 52% aligned with human patterns (chance level)
  • Persona injection changed self-report but NOT behavior

The critical distinction: Self-report predicts linguistic behavior (writing style) but not interactive behavior (sycophancy, decisions, actions). The behaviors that matter for alignment are exactly what self-report fails to predict.

Validity Problems Are Documented

Others have noticed. Suehr et al. (2024) ran factor analysis and found the five-factor structure doesn’t replicate in LLMs. The Alan Turing Institute found a 10-factor solution instead of HEXACO’s 6.

The Deeper Realization

I asked: why do psychometrics exist for humans?

Answer: behavioral testing doesn’t scale. You can’t run job simulations on thousands of candidates. Psychometrics are efficient abstractions—self-report proxies for behavior we can’t observe directly.

For LLMs, this constraint doesn’t exist.

Petri runs thousands of behavioral scenarios cheaply. Bloom generates probes automatically. You can measure the behaviors you care about—sycophancy, deception, cooperation—directly.

Per The Algorithm: before optimizing a process, ask whether you should delete it entirely.

The contamination hypothesis (novel items might rescue validity) is interesting. But even if it worked, we’d be adding an abstraction layer to predict something we can measure directly.

Decision: Skip the psychometric abstraction. Go straight to behavioral measurement.

The Zoom Out

If not psychometrics, what’s the contribution?

The highest-impact opportunities:

  1. Multi-agent dynamics — When agents work in teams, coordination failures emerge. Hammond et al. identify these but offer no framework.

  2. Human-AI collaboration — How do humans stay in control? How do they develop intuitions about agent behavior?

Where This Lands

The revised core question: How do humans understand and stay in control of multi-agent AI systems?

Why this matters:

  • Help humans stay in control as agents proliferate
  • Help humans develop intuitions about AI behavior
  • Study agentic behavior to model it for safe, controllable multi-agent systems

What changes:

  • Psychometric measurement → behavioral assessment
  • Individual agent profiling → multi-agent dynamics + human-AI collaboration
  • Miniverse becomes central for multi-agent dynamics research

What stays:

  • The HEXACO profiles as interesting empirical findings about training artifacts
  • The Machine Psychology framing (behavioral > mechanistic)
  • The commitment to methodological rigor

The Lesson

I jumped to conclusions. I didn’t do thorough literature review before building pipelines. I assumed I was seeing gaps that weren’t there.

Shoshin—beginner’s mind. Stay curious, stay humble, stay ready to update.


The beginner’s mind sees what specialists miss. Also: knowing when to delete a step is as important as knowing how to optimize it.