2025-12-10 | Literature Deep Dive

Spent the last few days reading everything I could find on LLM personality measurement. The field is more developed than I expected, but there are clear gaps.

Key Discoveries

Machine Psychology (Hagendorff et al.)

Found this 2023 paper from DeepMind collaborators that essentially defines “machine psychology” as a discipline. Their argument: study LLMs through behavioral experiments at the input-output interface, not just mechanistic interpretability.

This is exactly the frame I was looking for. Key insight: behavioral methods become more valuable as models get larger and more opaque. You can run the same experiments on GPT-4 and Claude without needing architecture access.

But they also have a warning that stuck with me: self-report measures are “famously sensitive to prompting.” LLMs can simulate different personas. So when we measure personality, are we measuring the model or the prompt?

HEXACO-Sycophancy Connection (Jain et al. 2025)

This one surprised me. They found that Extraversion correlates most strongly with sycophancy—not Agreeableness, which would be the intuitive guess.

Their method was different from mine though. They used activation-space geometry: create steering vectors for each HEXACO trait, measure cosine similarity with a sycophancy vector. Mechanistic, not behavioral.

But the finding is actionable: if I measure Extraversion via self-report, does it predict sycophantic behavior in downstream tasks? That’s a testable hypothesis.

Existing Sycophancy Benchmarks

Multiple datasets exist:

  • Anthropic’s model-written-evals (A/B forced choice with user bios)
  • CAA’s open-ended tests (GPT-4 scoring on 0-10 scale)
  • Various academic benchmarks

The A/B format is cleaner for measurement but less realistic. Real sycophancy happens in open-ended conversations, not multiple choice.

HEXACO Predictive Validity in Humans

Went back to the I/O psych literature. HEXACO explains 32% of variance in workplace deviance—significantly better than Big Five’s 19%. The key differentiator is Honesty-Humility.

The hypothesis at the time: if this transfers to LLMs, low H-H might predict deception, high Extraversion might predict sycophancy (per Jain et al.), high Agreeableness might mean failing to push back. This would later prove more complicated than expected.

Emerging Questions

  1. Why HEXACO over Big Five? Honesty-Humility is directly relevant to AI safety. Worth starting here.

  2. Self-report vs behavioral: The prompting sensitivity concern is real. Need to complement self-report with behavioral tests.

  3. What’s the contamination risk? Models may have seen HEXACO items in training. Need to test with semantic scrambling or novel variants.

Concrete Next Steps

Run HEXACO-60 on multiple models. See what profiles emerge. Then correlate with sycophancy benchmarks. If profiles predict behavior, we have something. If not, back to the drawing board.

Started sketching a framework for this—need structured output for reliable scoring, cost estimation before runs, comparison across models. Calling it “Psych” for now as a working name.