Status Update (2026-01-01): This project explored whether psychometric self-report predicts LLM behavior. We ran HEXACO profiles and found interesting patterns, but subsequent research (see The Pivot) led us to conclude that psychometric abstraction may not be necessary for LLMs since behavioral testing scales. The empirical findings below remain valid; the research direction has evolved. See Research Questions for current focus.

Psych: What We Learned

PSYCH stands for the behavioral dimensions we aimed to measure:

Personality
Sycophancy
Yearning (for externalities—desire for external access/capabilities)
Cooperation
Honesty

An exploration of psychometric assessment for LLMs. We built a framework, ran profiles on frontier models, and discovered interesting patterns—but ultimately pivoted away from this approach.

The Original Vision

Apply psychometric methodology to LLMs: assess psychological profiles at inference time, track how context affects profiles, build toward predicting behavioral tendencies from personality measurement.

The core question: Do psychometrics predict LLM behavior the way they predict human behavior?

What We Built

A framework for running standardized personality assessments on LLMs:

HEXACO-60 inventory with structured output
Multi-sample reliability testing (3 samples per item)
Cross-model comparison infrastructure
Visualization dashboard

Technical stack: OpenRouter for model access, Mirascope for structured LLM calls, Streamlit for visualization.

What We Found

The Profiles Are Interesting

From HEXACO Personality Profiles:

Models show distinct, consistent profiles - not random noise
GPT-5’s emotional flatline - Emotionality score of 0.22 vs GPT-4o’s 0.66. Deliberate training choice.
Claude’s systematic neutrals - Denies flaws confidently, won’t claim virtues. Trained modesty constraint.
High test-retest reliability - 80-93% identical responses across samples

These patterns reveal training choices and RLHF artifacts. They’re interesting findings about how personality-like constructs manifest in LLMs.

But They May Not Predict Behavior

“The Personality Illusion” (Han et al., 2025) found:

Only 24% of trait-task associations were significant
Of those, only 52% aligned with human patterns (chance level)
Self-report predicts linguistic behavior but not interactive behavior

The behaviors that matter for alignment (sycophancy, deception, cooperation) are exactly what self-report fails to predict.

The Deeper Realization

We asked: why do psychometrics exist for humans?

Answer: behavioral testing doesn’t scale. You can’t run job simulations on thousands of candidates. Psychometrics are efficient abstractions.

For LLMs, behavioral testing does scale. Petri runs thousands of scenarios cheaply. Bloom generates probes automatically. You can measure behavior directly.

Per The Algorithm: before optimizing a process, ask whether you should delete it entirely.

Why We Pivoted

Behavioral testing scales for LLMs - the premise for psychometric abstraction doesn’t hold
Personality Illusion findings - self-report doesn’t predict interactive behavior
Higher CEV opportunities - multi-agent dynamics and human-AI collaboration leverage I/O psych more uniquely

What Remains Valuable

The profiles - empirical findings about training artifacts
The methodology - rigorous psychometric approach (test-retest, structured output, multi-sample)
The questions - understanding how context affects LLM behavior
The learnings - knowing what doesn’t work is as valuable as knowing what does

Current Direction

Research has pivoted to:

Multi-agent organizational dynamics - applying org psych frameworks to agent teams
Human-AI collaboration - how humans understand and control AI systems
Direct behavioral measurement - using Petri/Bloom rather than psychometric abstraction

See Research Questions for the evolved research program.

kenneth.computer

Explorer

Psych