research

Notes on AI behavior and human-AI control.

As AI systems grow more capable, understanding their behavior becomes critical. Machine psychology is emerging as a discipline to meet this need. Open Phil is funding black-box LLM psychology research. OpenAI says AI safety needs social scientists. These notes are my contribution to that conversation.

Core questions

I’m primarily interested in two overlapping research areas related to human-AI coordination. My frame of reference for these questions is influenced by my background in industrial/organizational psychology. More specifically, I am attempting to integrate the learnings and methodologies of behavioral science and sociotechnical theory through the lens of cybernetics.

How do humans remain in control in an increasingly multi-agent world?

What are the technical and organizational structures that enable humans to understand and control increasingly complex AI systems? As multi-agent systems become more capable and less intelligible, how do we stay in the loop?

Can AI systems learn to reason about psychology?

Evidence from systems like Plastic Labs’ Neuromancer suggests reasoning models can learn to track psychological constructs. If AI can hold a psychological model and reason about it, this opens several possibilities:

  1. Better state inference: Inferring from the psychological states of users (humans or other AIs) from conversation or unstructured data
  2. Real-time intervention: Situational assessment and response strategies—detecting problems and responding to de-escalate, steer, or even shut down engagements
  3. Agent orchestration: Improving how agents orchestrate other agentic systems by enabling them with better state inference to apply real-time intervention strategies that lead to better performance and greater visibility for human orchestrators
  4. Scientific advancement: Building autonomous behavioral scientists and machine psychologists that can properly reason about these domains
  5. Applied domains: Psychotherapy, interviewing, negotiation, influence

Of these, I’m particularly interested in 3 and 4, which I believe are downstream of 1 and 2. I hope that people who read my work can also improve outcomes for 5.

It is possible that this leads to the need for interpretability research as well. However, while I believe that mechanistic interpretability (neuroscience for AI minds) can produce more causal explanations, there are problems with this approach that are addressed in my note on machine psychology. Specifically, that increasingly advanced AI will be more complex in parameters, making this approach more difficult, while behavioral approaches are both more scalable (experiments more likely to translate from one model to the next, regardless of size) and practical as they do not require access to the underlying model weights and are often more interpretable and even implementable by practitioners in the field, who will far outnumber the AI neuroscientists.

Ultimately, my goal is to help push the needle forward on practical AI safety, corrigibility, and control research.


3 items under this folder.