2025-12-10 | Discovering machine psychology

Spent the last few days reading everything I could find on LLM personality measurement. The field is more developed than expected, but there are clear gaps.

Machine psychology

DeepMind published a lovely paper by Hagendorff et al. (2024) which essentially defines “machine psychology” as a discipline.

Instead of relying on studying the inner workings of LLMs through mechanistic interpretability, we can borrow from human psychology, cognitive science, and the behavioral sciences to advance AI safety research from an alternate lens.

One of the advantages of behavioral methods over mechanistic interpretability is that it scales. As models get larger, surpassing hundreds of billions or even trillions of parameters, behaviors could emerge from complex interactions. Treating behavior as the experimental variable of interest scales more gracefully with model size. Additionally, behavioral analysis of LLMs can be conducted regardless of whether the experimenter has access to the models weights, making it a prime approach for independent study.

Best practices in LLM behavioral experimentation

The paper identifies several best practices for LLM behavioral experiments.

Contamination: Don’t use verbatim psychology stimuli as models may have seen them in training. Create novel variants with new wording, designs, and tasks.

Sampling: Don’t rely on small or convenience samples. LLMs are sensitive to prompt wording, so test multiple versions of each task. Use batteries of varied prompts to confirm behaviors are systematic.

Control for known LLM Biases

Recency bias (overweight info at end of prompt)
Common token bias (favor frequent training tokens)
Majority label bias (skew toward frequent examples in few-shot)

Eliciting capabilities

Poor prompting can underestimate capabilities. Techniques that help: - Chain-of-thought (“Let’s think step by step”) - Multiple reasoning paths with majority vote - Least-to-most (decompose into subproblems) - Multiple choice format (but watch for recency bias—shuffle answer order) - Few-shot examples - Self-reflection (recursive self-critique) - Code-based reasoning for symbolic/numeric tasks

Parameters & evaluation

Temperature: 0 for reproducibility, but report averages or “best of K” to avoid seed bias.

Scoring: Multiple choice is cleanest. For free generation, options include:

Regex/F1 if outputs are regular
LLM-as-judge with careful instructions
Force structured output with delimiter like ”####” after reasoning
Manual evaluation as fallback

Important: Performance ≠ competence. Poor results don’t prove absence of capability. Could just be bad prompting.

Other work

HEXACO-sycophancy connection

Jain et al. (2025) found that Extraversion correlates most strongly with sycophancy—not Agreeableness, which would be the intuitive guess. They used activation-space geometry: steering vectors for each HEXACO trait, measured cosine similarity with a sycophancy vector.

Actionable hypothesis: if I measure Extraversion via self-report, does it predict sycophantic behavior in downstream tasks?

Existing sycophancy benchmarks

Multiple datasets exist:

Anthropic’s model-written-evals (A/B forced choice with user bios)
CAA’s open-ended tests (GPT-4 scoring)
Various academic benchmarks

The A/B format is cleaner for measurement but less realistic. Sycophancy behaviors happen in open-ended conversations, not multiple choice.

kenneth.computer

Explorer

Discovering Machine Psychology