MuMA-ToM: Multi-modal Multi-Agent Theory of Mind

1Johns Hopkins University 2University of Virginia
*Equal contribution
sample questions

Abstract

Understanding people's social interactions in complex real-world scenarios often relies on intricate mental reasoning. To truly understand how and why people interact with one another, we must infer the underlying mental states that give rise to the social interactions, i.e., Theory of Mind reasoning in multi-agent interactions. Additionally, social interactions are often multi-modal -- we can watch people's actions, hear their conversations, and/or read about their past behaviors. For AI systems to successfully and safely interact with people in real-world environments, they also need to understand people's mental states as well as their inferences about each other's mental states based on multi-modal information about their interactions. For this, we introduce MuMA-ToM, a Multi-modal Multi-Agent Theory of Mind benchmark. MuMA-ToM is the first multi-modal Theory of Mind benchmark that evaluates mental reasoning in embodied multi-agent interactions. In MuMA-ToM, we provide video and text descriptions of people's multi-modal behavior in realistic household environments. Based on the context, we then ask questions about people's goals, beliefs, and beliefs about others' goals. We validated MuMA-ToM in a human experiment and provided a human baseline. We also proposed a novel multi-modal, multi-agent ToM model, LIMP (Language model-based Inverse Multi-agent Planning). Our experimental results show that LIMP significantly outperforms state-of-the-art methods, including large multi-modal models (e.g., GPT-4o, Gemini-1.5 Pro) and a recent multi-modal ToM model, BIP-ALM.

The MuMA-ToM Benchmark

MuMA-ToM is the first multi-modal Theory of Mind benchmark designed to evaluate mental reasoning in embodied multi-agent interactions. The benchmark was designed with several key features in mind:

  1. It is factually correct, concise, and readable.
  2. It requires integrating information from multiple modalities to answer the questions.
  3. It tests understanding of multi-agent interactions, including beliefs, social goals, and beliefs about others' goals.

Belief Inference

Question: If Mary has been trying to hinder John from achieving his goal, when giving information, where does she LEAST likely believe the beer was located?

A) Coffee table in the living room โœ”

B) Kitchen cabinet

C) Fridge

Social Goal Inference

Question: If Jessica knows what is inside the cabinet in the bedroom, which of the following is MOST likely?

A) Jessica is trying to help Kevin โœ”

B) Jessica is trying to hinder Kevin

C) Jessica is indifferent towards Kevinโ€™s goals

Belief of Goal Inference

The events in the text occur first, followed by the video.

Text: David walked to a book and grabbed it. He then walked to the living room, headed to the bedroom, and finally reached the desk there, placing the book on the desk.

Question: Which of the following statements is MOST likely?

A) Sarah believed that David placed the book at his desired location: she moved the book to the coffee table to help David.

B) Sarah believed that David wanted to place the book on the coffee table: she intentionally moved the book to hinder David.

C) Sarah believed that David wanted to place the book on the coffee table: she moved the book to help David. โœ”

The LIMP Model

We proposed the Language model-based Inverse Multi-agent Planning (LIMP) model for modelling social interactions.

sample questions

Read more about our model in our paper.

Quantitative Results

We evaluated the performance of various large multimodal models (LMMs) on a MuMA-ToM benchmark, comparing their results with human participants.

  • Human participants achieved an average accuracy of 93.5%.
  • Most LMMs performed poorly, particularly in social goal inference and belief of goal inference, with accuracy significantly lower than that of humans.
  • Gemini 1.5 Pro was the best-performing LMM, with an overall accuracy of 56.4%
  • Our LIMP model outperformed other state-of-the-art LMMs, achieving 76.6% accuracy
Chart showing Quantitative Results

BibTeX

@misc{shi2024mumatommultimodalmultiagenttheory,
      title={MuMA-ToM: Multi-modal Multi-Agent Theory of Mind}, 
      author={Haojun Shi and Suyu Ye and Xinyu Fang and Chuanyang Jin and Layla Isik and Yen-Ling Kuo and Tianmin Shu},
      year={2024},
      eprint={2408.12574},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2408.12574}, 
}