Machine Psychology: The Foundational Paper

The 2023 paper (updated 2024) from DeepMind and collaborators that establishes machine psychology as a discipline. This is the theoretical foundation for our research approach.

Paper: arXiv:2303.13988 Authors: Hagendorff, Dasgupta, Binz, Chan, Lampinen, Wang, Akata, Schulz

Core Thesis

Study LLMs through behavioral experiments rather than (or alongside) mechanistic interpretability. Examine input-output relationships at the user-facing interface where outcomes matter, rather than inspecting weights and activations.

Why behavioral over mechanistic?

  • Scales with model size (no architecture access needed)
  • Works with closed-source models
  • Captures emergent behaviors too complex to predict from weights alone
  • Operates where real-world impact occurs

Four Primary Research Domains

1. Heuristics and Biases

Decision-making shortcuts and cognitive distortions. Studies found GPT-3 displayed human-like biases, but newer models largely eliminated them (possibly due to training improvements or data contamination).

Paradigms: Framing effects, ecological rationality, semantic content variations

2. Social Interactions

Theory of mind, cooperative/competitive behavior, communicative trade-offs.

Paradigms: False belief tasks (Wimmer & Perner), recursive mental state reasoning, negotiation dilemmas, network formation (preferential attachment, triadic closure)

3. Psychology of Language

Syntax, semantics, pragmatics processing.

Paradigms: Surprisal measures, priming techniques, garden path sentences, filler-gap dependencies, entailment assessment

4. Learning

In-context learning mechanisms, inductive biases, generalization.

Paradigms: Rule-based vs exemplar-based generalization, curriculum learning effects, spacing/repetition impacts, developmental psychology comparisons

Design Standards: Good Behavioral Experimentation

The paper’s methodological guidelines are essential reading for anyone running psych evals on LLMs.

Avoid Data Contamination

  • Don’t copy psychology stimuli verbatim - models may have seen them
  • Create “novel variants of classic” tasks with new wording, agents, scenarios
  • Use procedurally generated experiments when possible

Use Representative Sampling

  • “Batteries of varied prompts” not small convenience samples
  • LLMs are highly sensitive to minor wording variations
  • Test multiple versions of each task systematically

Control for Technical Biases

  • Recency bias: LLMs overweight end-of-prompt information
  • Common token bias: Models favor frequent training tokens
  • Majority label bias: Few-shot examples skew toward frequent labels

Mitigations: Shuffle answer orders, use varied prompt formulations, document temperature settings

Performance-Competence Distinction

From Chomsky: performance in a situation may not capture underlying competence. Poor results don’t prove absence of capability - behavioral inconsistency isn’t evidence of lacking abstract proficiency.

Capability Elicitation Caveat

Chain-of-thought prompting improves performance, but:

  • Enhanced results may be prompt artifacts, not fundamental capabilities
  • Different augmentations help some tasks, hinder others
  • Omitting them underestimates abilities; using them universally obscures true competencies

Self-Report Limitations

Critical warning: Properties like personality, morality, clinical disorders are “famously sensitive to prompting.” LLMs can simulate different personas when prompted differently.

Self-reports should be understood “as a property of a specific system prompt” not as fundamental model characteristics. This is directly relevant to our HEXACO work - see HEXACO Personality Profiles.

Behavioral vs. Mechanistic Interpretability

The paper argues these complement each other:

BehavioralMechanistic
Input-output relationshipsWeights and activations
Scales with model sizeRequires architecture access
Works on closed-sourceOpen-source only
User-facing interfaceInternal mechanisms
What models doHow models work

Neither replaces the other. Behavioral methods become more valuable as models grow “more powerful, opaque, multi-modal, and integrated into complex real-world settings.”

Future Directions Proposed

  1. Multimodal expansion: Apply paradigms to vision, sensory processing
  2. Longitudinal studies: Track capability development over time
  3. AI safety applications: Forecast alignment implications
  4. Embodied interaction: LLMs with tool use
  5. Domain expansion: Creativity, moral reasoning, clinical psychology

Relevance to Our Research

This paper provides the theoretical foundation for our approach:

  1. Behavioral over mechanistic: We focus on behavioral measurement, aligning with their argument that this is where real-world impact occurs. This insight led us to skip psychometric abstraction entirely—if behavioral testing scales for LLMs (unlike humans), go straight to behavior.

  2. Self-report caveat validated: Their warning about personality measures being “famously sensitive to prompting” was borne out by the literature. The Personality Illusion paper (2025) showed self-report fails to predict interactive behavior.

  3. Methodology alignment: Our behavioral evaluation work follows their contamination avoidance and representative sampling guidelines.

  4. Performance-competence: Variation in responses doesn’t mean incompetence—important for interpreting behavioral probe results.

  5. Domain expansion: Their taxonomy (heuristics, social interactions, language, learning) maps well to multi-agent dynamics research.

Key Citations

  • Binz & Schulz (2023) - GPT-3 displays human-like cognitive biases
  • Dasgupta et al. (2022) - Semantic content effects on logical reasoning
  • Chan et al. (2022) - Rule-based vs exemplar-based generalization
  • Wilcox et al. (2023) - Surprisal measures for syntax
  • Street et al. (2024) - Higher-order theory of mind

See also: Research Notes, Behavioral Evaluation Tools, HEXACO Profiling, LLM Psychometrics, Research Log