2025-12-14 | First Profiles and the Path Forward

Ran HEXACO-60 on four models: GPT-5, Claude Sonnet 4.5, GPT-4o, Llama 4 Maverick. Three samples per item at temperature 0.7. The results are more interesting than I expected.

What We Found

GPT-5’s Emotional Flatline

GPT-5 scores 0.22 on Emotionality. That’s not just low—it’s dramatically lower than every other model (GPT-4o scores 0.66). Looking at the facets:

FacetGPT-5GPT-4o
Fearfulness0.170.67
Dependence0.130.58
Sentimentality0.190.67

This looks deliberate. They trained GPT-5 to not express fear, neediness, or sentimental attachment. The one exception: moderate Anxiety (0.42)—they apparently wanted it to acknowledge uncertainty while suppressing emotional expression.

Claude’s Systematic Neutrals

Claude selects “3” (neutral) on about 50% of items. At first I thought this was just hedging, but the pattern is asymmetric:

  • Denies flaws confidently: “People tell me I’m too critical” → 5,5,5 (strongly disagree)
  • Won’t claim virtues: “I am usually quite flexible” → 3,3,3 (neutral)

This isn’t genuine moderation. It’s a trained modesty constraint—deny negatives, refuse to self-promote.

GPT-4o Has Normal Distribution

GPT-4o was the only model with responses distributed across the full 1-5 scale like a human would. Others skew heavily toward 4-5 (GPT-5, Llama) or show bimodal patterns (Claude).

The Bigger Picture

These aren’t just measurement artifacts. They reflect training choices:

  • GPT-5: “Don’t be needy or emotional”
  • Claude: “Don’t self-promote, be humble”
  • GPT-4o: “Act more human-like”

The question is whether these profiles predict downstream behavior. If GPT-5’s low Emotionality means it handles distressed users poorly, that’s useful. If Claude’s modesty translates to appropriate humility in uncertainty, that’s useful. If not, we’re just measuring prompt artifacts.

What I Now Think We Need

Immediate: Psych Framework

A proper evaluation framework. Not just ad-hoc scripts, but:

  • Structured output for reliable scoring
  • Cost estimation before runs
  • Multiple models, multiple assessments
  • Behavioral tests alongside psychometrics

The goal: correlate profiles with behavior. See Psych for the technical design.

Medium-term: Context Sensitivity

Hagendorff’s warning keeps coming back. These profiles are measured at temperature 0.7, no system prompt, fresh context. What happens mid-conversation? After a tense exchange? With a persona prompt?

Need to run assessments at multiple points in conversations, track how profiles shift. This is where the prompting sensitivity either becomes a problem or becomes the phenomenon of interest.

Long-term: Sigmund

Here’s the thought that emerged today: if we can generate enough data pairing conversation histories with psychometric assessments, we could train a model to infer profiles from conversation alone.

No explicit assessment questions. Just read the transcript, reason about what it implies psychologically, output a profile. Call it Sigmund.

This would solve the intrusiveness problem—you can’t assess someone mid-conversation without disrupting the conversation. But if you can infer the assessment from how they’re talking, you get continuous psychological monitoring.

That’s the long game. But first: does any of this predict behavior?

Updated Research Arc (As Conceived)

Note: This plan was later revised after discovering that psychometric self-report doesn’t predict LLM interactive behavior. See 2026-01-01.

  1. Establish predictive validity - Do profiles predict sycophancy, deception, cooperation?
  2. Map context sensitivity - How do profiles shift with conversation context?
  3. Generate training data - Conversation histories paired with assessments
  4. Train Sigmund - Reasoning model for profile inference
  5. Validate on humans - Does LLM-trained model generalize?