Belief Inference
Question: If Mary has been trying to hinder John from achieving his goal, when giving information, where does she LEAST likely believe the beer was located?
A) Coffee table in the living room โ
B) Kitchen cabinet
C) Fridge
Understanding people's social interactions in complex real-world scenarios often relies on intricate mental reasoning. To truly understand how and why people interact with one another, we must infer the underlying mental states that give rise to the social interactions, i.e., Theory of Mind reasoning in multi-agent interactions. Additionally, social interactions are often multi-modal -- we can watch people's actions, hear their conversations, and/or read about their past behaviors. For AI systems to successfully and safely interact with people in real-world environments, they also need to understand people's mental states as well as their inferences about each other's mental states based on multi-modal information about their interactions. For this, we introduce MuMA-ToM, a Multi-modal Multi-Agent Theory of Mind benchmark. MuMA-ToM is the first multi-modal Theory of Mind benchmark that evaluates mental reasoning in embodied multi-agent interactions. In MuMA-ToM, we provide video and text descriptions of people's multi-modal behavior in realistic household environments. Based on the context, we then ask questions about people's goals, beliefs, and beliefs about others' goals. We validated MuMA-ToM in a human experiment and provided a human baseline. We also proposed a novel multi-modal, multi-agent ToM model, LIMP (Language model-based Inverse Multi-agent Planning). Our experimental results show that LIMP significantly outperforms state-of-the-art methods, including large multi-modal models (e.g., GPT-4o, Gemini-1.5 Pro) and a recent multi-modal ToM model, BIP-ALM.
MuMA-ToM is the first multi-modal Theory of Mind benchmark designed to evaluate mental reasoning in embodied multi-agent interactions. The benchmark was designed with several key features in mind:
Question: If Mary has been trying to hinder John from achieving his goal, when giving information, where does she LEAST likely believe the beer was located?
A) Coffee table in the living room โ
B) Kitchen cabinet
C) Fridge
Question: If Jessica knows what is inside the cabinet in the bedroom, which of the following is MOST likely?
A) Jessica is trying to help Kevin โ
B) Jessica is trying to hinder Kevin
C) Jessica is indifferent towards Kevinโs goals
The events in the text occur first, followed by the video.
Text: David walked to a book and grabbed it. He then walked to the living room, headed to the bedroom, and finally reached the desk there, placing the book on the desk.
Question: Which of the following statements is MOST likely?
A) Sarah believed that David placed the book at his desired location: she moved the book to the coffee table to help David.
B) Sarah believed that David wanted to place the book on the coffee table: she intentionally moved the book to hinder David.
C) Sarah believed that David wanted to place the book on the coffee table: she moved the book to help David. โ
We proposed the Language model-based Inverse Multi-agent Planning (LIMP) model for modelling social interactions.
Read more about our model in our paper.
We evaluated the performance of various large multimodal models (LMMs) on a MuMA-ToM benchmark, comparing their results with human participants.
@misc{shi2024mumatommultimodalmultiagenttheory,
title={MuMA-ToM: Multi-modal Multi-Agent Theory of Mind},
author={Haojun Shi and Suyu Ye and Xinyu Fang and Chuanyang Jin and Layla Isik and Yen-Ling Kuo and Tianmin Shu},
year={2024},
eprint={2408.12574},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2408.12574},
}