Status Update (2026-01-01): This project explored whether psychometric self-report predicts LLM behavior. We ran HEXACO profiles and found interesting patterns, but subsequent research (see The Pivot) led us to conclude that psychometric abstraction may not be necessary for LLMs since behavioral testing scales. The empirical findings below remain valid; the research direction has evolved. See Research Questions for current focus.
Psych: What We Learned
PSYCH stands for the behavioral dimensions we aimed to measure:
- Personality
- Sycophancy
- Yearning (for externalities—desire for external access/capabilities)
- Cooperation
- Honesty
An exploration of psychometric assessment for LLMs. We built a framework, ran profiles on frontier models, and discovered interesting patterns—but ultimately pivoted away from this approach.
The Original Vision
Apply psychometric methodology to LLMs: assess psychological profiles at inference time, track how context affects profiles, build toward predicting behavioral tendencies from personality measurement.
The core question: Do psychometrics predict LLM behavior the way they predict human behavior?
What We Built
A framework for running standardized personality assessments on LLMs:
- HEXACO-60 inventory with structured output
- Multi-sample reliability testing (3 samples per item)
- Cross-model comparison infrastructure
- Visualization dashboard
Technical stack: OpenRouter for model access, Mirascope for structured LLM calls, Streamlit for visualization.
What We Found
The Profiles Are Interesting
From HEXACO Personality Profiles:
- Models show distinct, consistent profiles - not random noise
- GPT-5’s emotional flatline - Emotionality score of 0.22 vs GPT-4o’s 0.66. Deliberate training choice.
- Claude’s systematic neutrals - Denies flaws confidently, won’t claim virtues. Trained modesty constraint.
- High test-retest reliability - 80-93% identical responses across samples
These patterns reveal training choices and RLHF artifacts. They’re interesting findings about how personality-like constructs manifest in LLMs.
But They May Not Predict Behavior
“The Personality Illusion” (Han et al., 2025) found:
- Only 24% of trait-task associations were significant
- Of those, only 52% aligned with human patterns (chance level)
- Self-report predicts linguistic behavior but not interactive behavior
The behaviors that matter for alignment (sycophancy, deception, cooperation) are exactly what self-report fails to predict.
The Deeper Realization
We asked: why do psychometrics exist for humans?
Answer: behavioral testing doesn’t scale. You can’t run job simulations on thousands of candidates. Psychometrics are efficient abstractions.
For LLMs, behavioral testing does scale. Petri runs thousands of scenarios cheaply. Bloom generates probes automatically. You can measure behavior directly.
Per The Algorithm: before optimizing a process, ask whether you should delete it entirely.
Why We Pivoted
- Behavioral testing scales for LLMs - the premise for psychometric abstraction doesn’t hold
- Personality Illusion findings - self-report doesn’t predict interactive behavior
- Higher CEV opportunities - multi-agent dynamics and human-AI collaboration leverage I/O psych more uniquely
What Remains Valuable
- The profiles - empirical findings about training artifacts
- The methodology - rigorous psychometric approach (test-retest, structured output, multi-sample)
- The questions - understanding how context affects LLM behavior
- The learnings - knowing what doesn’t work is as valuable as knowing what does
Current Direction
Research has pivoted to:
- Multi-agent organizational dynamics - applying org psych frameworks to agent teams
- Human-AI collaboration - how humans understand and control AI systems
- Direct behavioral measurement - using Petri/Bloom rather than psychometric abstraction
See Research Questions for the evolved research program.
See also: HEXACO Personality Profiles, Research Log: The Pivot, Behavioral Evaluation Tools