MuMA-ToM: Multi-modal Multi-Agent Theory of Mind

1Johns Hopkins University 2University of Virginia
*Equal contribution

MuMA-ToM systematically evaluates the cognitive ability to understand multi-agent social interactions while fusing information from multimodal data.

Read more about our benchmark in our paper

Method Belief Social Goal Belief of Goal All
Human 98.9 94.4 87.1 93.5
Gemini 1.5 Flash 53.9 33.0 41.4 42.7
Gemini 1.5 Pro 78.9 43.9 46.9 56.4
Llava 1.6 13B 70.2 43.2 17.9 43.7
Llava 1.6 34B 93.6 37.2 27.5 52.8
GPT-4o 67.9 39.6 44.4 50.6
InternVL 2 8B 62.2 44.6 45.1 50.6
InternVL 2 26B 59.3 44.9 35.5 46.6
VideoLlama 2 7B 70.1 45.6 37.7 51.1
BIP-ALM 41.2 34.1 30.6 33.9
LIMP 93.4 67.7 68.7 76.6