MuMA-ToM: Multi-modal Multi-Agent Theory of Mind

Haojun Shi^1*, Suyu Ye^1*, Xinyu Fang¹, Chuanyang Jin¹, Leyla Isik¹, Yen-Ling Kuo², Tianmin Shu¹

¹Johns Hopkins University ²University of Virginia

^*Equal contribution

MuMA-ToM systematically evaluates the cognitive ability to understand multi-agent social interactions while fusing information from multimodal data.

Read more about our benchmark in our paper

Method	Belief	Social Goal	Belief of Goal	All
Human	98.9	94.4	87.1	93.5
Gemini 1.5 Flash	53.9	33.0	41.4	42.7
Gemini 1.5 Pro	78.9	43.9	46.9	56.4
Llava 1.6 13B	70.2	43.2	17.9	43.7
Llava 1.6 34B	93.6	37.2	27.5	52.8
GPT-4o	67.9	39.6	44.4	50.6
InternVL 2 8B	62.2	44.6	45.1	50.6
InternVL 2 26B	59.3	44.9	35.5	46.6
VideoLlama 2 7B	70.1	45.6	37.7	51.1
BIP-ALM	41.2	34.1	30.6	33.9
LIMP	93.4	67.7	68.7	76.6