ICML 2026

MindZero

Learning Online Mental Reasoning With Zero Annotations

Shunchi Zhang1*, Jin Lu1*, Chuanyang Jin1*, Yichao Zhou2*, Zhining Zhang2, Tianmin Shu1

1 Johns Hopkins University    2 Peking University    * Equal contribution

01 / Online Mental Reasoning

Infer mental states from a partial behavior stream

Example of online mental reasoning

At every time step, the assistant maintains mental-state hypotheses over latent human goals.

  • Uncertainty robust uncertainty over multiple hypotheses
  • Efficiency fast inference for real-time assistance
  • Zero annotations learning with zero ground-truth annotations

02 / Bayesian Theory of Mind

A standard target, but expensive to run online

P(mt | s1:t, a1:t) P(a1:t | mt, s1:t) · P(mt)

Posterior the mental-state distribution on the left side

Likelihood how likely the observed actions are under this state

Prior whether the mental state itself is plausible

Model-based ToM (e.g. AutoToM) estimates it through Bayesian networks, but each edge may require an LLM call, which is slow and expensive.

03 / Self-Supervised Reinforcement Learning

Amortize model-based ToM into one forward pass

Self-supervised reinforcement learning pipeline

Model-based reasoning can be used as the training signal for amortization.

  • Proposal A full particle set of candidate mental states
  • Scoring A planner or frozen LLM scorer checks action likelihood
  • Optimization Reinforcement learning for non-differentiable scoring

At test time, the model can produce hypotheses in a single pass.

04 / Objective

ELBO as the reward

ELBO reward decomposition

ELBO encourages hypotheses that explain actions and keep uncertainty.

  • Likelihood high action likelihood
  • Prior prior plausibility
  • Entropy discourages early collapse

05 / Setup

Domains and tasks for evaluation

GridWorld and household evaluation domains

Domains

  • GridWorld: visual map input
  • Household: text-converted scenarios

Tasks

  • Story-based Theory-of-Mind question answering
  • Proactive assistance

06 / Question Answering

MindZero improves story-based QA accuracy

Question answering results

MindZero improves significantly over the pretrained checkpoint and is competitive with strong commercial models using much less computation.

07 / Proactive Assistance

Proactive assistance tests online inference

Proactive assistance results

MindZero obtains the best speedup efficiently collaborating with both simulated and real humans.

08 / Analysis

Mode seeking does not become mode collapse

Prediction quality over task progress
Prediction quality sharpens as more actions are observed.
Ablation study on diversity controls
Diversity depends on prior design, multiple hypotheses, and entropy bonus.

09 / Takeaways

Mental reasoning can be learned with self-supervision

Problem Online mental reasoning with uncertainty, efficiency, and zero annotations

Method Online mental reasoning can be learned with self-supervision using RL

Result Efficient single-pass inference at test time and strong results across domains and tasks