MuMA-ToM systematically evaluates the cognitive ability to understand multi-agent social interactions while fusing information from multimodal data.
Read more about our benchmark in our paper
Method | Belief | Social Goal | Belief of Goal | All |
---|---|---|---|---|
Human | 98.9 | 94.4 | 87.1 | 93.5 |
Gemini 1.5 Flash | 53.9 | 33.0 | 41.4 | 42.7 |
Gemini 1.5 Pro | 78.9 | 43.9 | 46.9 | 56.4 |
Llava 1.6 13B | 70.2 | 43.2 | 17.9 | 43.7 |
Llava 1.6 34B | 93.6 | 37.2 | 27.5 | 52.8 |
GPT-4o | 67.9 | 39.6 | 44.4 | 50.6 |
InternVL 2 8B | 62.2 | 44.6 | 45.1 | 50.6 |
InternVL 2 26B | 59.3 | 44.9 | 35.5 | 46.6 |
VideoLlama 2 7B | 70.1 | 45.6 | 37.7 | 51.1 |
BIP-ALM | 41.2 | 34.1 | 30.6 | 33.9 |
LIMP | 93.4 | 67.7 | 68.7 | 76.6 |