2025-12-28 | The Sigmund Pipeline

Spent time crystallizing the research arc from psychometrics to Sigmund. At this point, still operating under the hypothesis that psychometric profiles could predict LLM behavior.

The Contamination Problem

Standard psychometric instruments (HEXACO-60, BFI) are almost certainly in training data. If we measure personality with instruments models have seen, we’re not measuring personality—we’re measuring how well they learned to respond to those specific items.

Proposed solution: write novel items. As an I/O psychologist, I can generate psychometrically sound items that map to known constructs but haven’t appeared in any training corpus.

The Validation Layer

Novel items need validation. Petri could provide this through SJT scenarios—situational judgment tests that probe behavior, not self-report. If a model scores high on Agreeableness but defects in cooperative scenarios, the psychometric is broken. If profiles predict behavior, we have something real.

Runtime Sampling

The intrusiveness problem: you can’t assess someone mid-conversation without disrupting the conversation. But what if you could sample assessment responses at arbitrary points in conversation history?

The method:

  1. Grab conversation datasets from HuggingFace
  2. Truncate at various points (early, mid, late)
  3. Inject: conversation prefix + assessment item + “explain your reasoning”
  4. Collect the model’s response and its explanation
  5. Repeat across models, conversations, truncation points

This would give data on how personality expression varies with conversational context.

The Pipeline (As Conceived)

Novel psychometric items (uncontaminated)
           ↓
Petri SJT validation (behavioral proof)
           ↓
Runtime sampling across HF datasets
           ↓
Publish method (academic contribution)
           ↓
Collect (prefix, item, response, reasoning) tuples
           ↓
Train Sigmund

Open Questions

  • What’s the minimum number of novel items needed per construct?
  • How do we handle conversation datasets with different formats/quality?
  • What’s the right truncation strategy—random points, semantic boundaries, fixed intervals?

Note: This pipeline was later revised after deeper literature review. See 2026-01-01 for the pivot.