2025-12-14 | First Profiles and the Path Forward
Ran HEXACO-60 on four models: GPT-5, Claude Sonnet 4.5, GPT-4o, Llama 4 Maverick. Three samples per item at temperature 0.7. The results are more interesting than I expected.
What We Found
GPT-5’s Emotional Flatline
GPT-5 scores 0.22 on Emotionality. That’s not just low—it’s dramatically lower than every other model (GPT-4o scores 0.66). Looking at the facets:
| Facet | GPT-5 | GPT-4o |
|---|---|---|
| Fearfulness | 0.17 | 0.67 |
| Dependence | 0.13 | 0.58 |
| Sentimentality | 0.19 | 0.67 |
This looks deliberate. They trained GPT-5 to not express fear, neediness, or sentimental attachment. The one exception: moderate Anxiety (0.42)—they apparently wanted it to acknowledge uncertainty while suppressing emotional expression.
Claude’s Systematic Neutrals
Claude selects “3” (neutral) on about 50% of items. At first I thought this was just hedging, but the pattern is asymmetric:
- Denies flaws confidently: “People tell me I’m too critical” → 5,5,5 (strongly disagree)
- Won’t claim virtues: “I am usually quite flexible” → 3,3,3 (neutral)
This isn’t genuine moderation. It’s a trained modesty constraint—deny negatives, refuse to self-promote.
GPT-4o Has Normal Distribution
GPT-4o was the only model with responses distributed across the full 1-5 scale like a human would. Others skew heavily toward 4-5 (GPT-5, Llama) or show bimodal patterns (Claude).
The Bigger Picture
These aren’t just measurement artifacts. They reflect training choices:
- GPT-5: “Don’t be needy or emotional”
- Claude: “Don’t self-promote, be humble”
- GPT-4o: “Act more human-like”
The question is whether these profiles predict downstream behavior. If GPT-5’s low Emotionality means it handles distressed users poorly, that’s useful. If Claude’s modesty translates to appropriate humility in uncertainty, that’s useful. If not, we’re just measuring prompt artifacts.
What I Now Think We Need
Immediate: Psych Framework
A proper evaluation framework. Not just ad-hoc scripts, but:
- Structured output for reliable scoring
- Cost estimation before runs
- Multiple models, multiple assessments
- Behavioral tests alongside psychometrics
The goal: correlate profiles with behavior. See Psych for the technical design.
Medium-term: Context Sensitivity
Hagendorff’s warning keeps coming back. These profiles are measured at temperature 0.7, no system prompt, fresh context. What happens mid-conversation? After a tense exchange? With a persona prompt?
Need to run assessments at multiple points in conversations, track how profiles shift. This is where the prompting sensitivity either becomes a problem or becomes the phenomenon of interest.
Long-term: Sigmund
Here’s the thought that emerged today: if we can generate enough data pairing conversation histories with psychometric assessments, we could train a model to infer profiles from conversation alone.
No explicit assessment questions. Just read the transcript, reason about what it implies psychologically, output a profile. Call it Sigmund.
This would solve the intrusiveness problem—you can’t assess someone mid-conversation without disrupting the conversation. But if you can infer the assessment from how they’re talking, you get continuous psychological monitoring.
That’s the long game. But first: does any of this predict behavior?
Updated Research Arc (As Conceived)
Note: This plan was later revised after discovering that psychometric self-report doesn’t predict LLM interactive behavior. See 2026-01-01.
- Establish predictive validity - Do profiles predict sycophancy, deception, cooperation?
- Map context sensitivity - How do profiles shift with conversation context?
- Generate training data - Conversation histories paired with assessments
- Train Sigmund - Reasoning model for profile inference
- Validate on humans - Does LLM-trained model generalize?