SimWorld-Robotics

Synthesizing Photorealistic and Dynamic Urban Environments for Multimodal Robot Navigation and Collaboration

NeurIPS 2025

Yan Zhuang¹, Jiawei Ren^2*, Xiaokang Ye^2*,

Jianzhi Shen³, Ruixuan Zhang³, Tianai Yue³, Muhammad Faayez³, Xuhong He⁴, Xiyan Zhang³,

Ziqiao Ma⁵, Lianhui Qin^2†, Zhiting Hu^2†, Tianmin Shu^3†

¹University of Virginia

²UC San Diego

³Johns Hopkins University

⁴Carnegie Mellon University

⁵University of Michigan

^*Indicates Equal Contribution, ^†Indicates Equal Advising

Paper Code SimWorld-20K

Abstract

Recent advances in foundation models have shown promising results in developing generalist robotics that can perform diverse tasks in open-ended scenarios given multimodal inputs. However, current work has been mainly focused on indoor, household scenarios. In this work, we present SimWorld-Robotics (SWR), a simulation platform for embodied AI in large-scale, photorealistic urban environments. Built on Unreal Engine 5, SWR procedurally generates unlimited photorealistic urban scenes populated with dynamic elements such as pedestrians and traffic systems, surpassing prior urban simulations in realism, complexity, and scalability. It also supports multi-robot control and communication. With these key features, we build two challenging robot benchmarks: (1) a multimodal instruction-following task, where a robot must follow vision-language navigation instructions to reach a destination in the presence of pedestrians and traffic; and (2) a multi-agent search task, where two robots must communicate to cooperatively locate and meet each other. Unlike existing benchmarks, these two new benchmarks comprehensively evaluate a wide range of critical robot capacities in realistic scenarios, including (1) multimodal instructions grounding, (2) 3D spatial reasoning in large environments, (3) safe, long-range navigation with people and traffic, (4) multi-robot collaboration, and (5) grounded communication. Our experimental results demonstrate that state-of-the-art models, including vision-language models (VLMs), struggle with our tasks, lacking robust perception, reasoning, and planning abilities necessary for urban environments.

Feature of SimWorld-Robotics

Support Scenario - Dynamic Traffic System

Support Scenario - Dynamic Obstacle Avoidance

Procedural City Generation

SimWorld-MMNav Benchmark

SimWorld-MMNav is a multimodal robot navigation benchmark, where a robot must follow multimodal instructions—paired language and visual hints—to reach a target in a large-scale, photorealistic, dynamic city environment. This requires grounding verbal and visual references in the 3D environment based on the robot's observations, while adapting to real-world complexities like traffic and pedestrians.

SimWorld-MMNav: Multimodal Instruction Following Navigation

SimWorld-MRS Benchmark

SimWorld-MRS (Multi-Robot Search) is a collaborative robotics benchmark that evaluates multi-agent coordination and communication capabilities. Two robots must work together to locate and meet each other in complex urban environments, testing their ability to communicate effectively and coordinate their actions.

SimWorld-MRS: Multi-Robot Search and Collaboration

BibTeX

@inproceedings{zhuang2025simworld,
        title={SimWorld-Robotics: Synthesizing Photorealistic and Dynamic Urban Environments for Multimodal Robot Navigation and Collaboration},
        author={Zhuang, Yan and Ren, Jiawei and Ye, Xiaokang and Shen, Jianzhi and Zhang, Ruixuan and Yue, Tianai and Faayez, Muhammad and He, Xuhong and Zhang, Xiyan and Ma, Ziqiao and Qin, Lianhui and Hu, Zhiting and Shu, Tianmin},
        booktitle={Advances in Neural Information Processing Systems},
        year={2025},
        url={https://openreview.net/forum?id=EyOtIOmMUh}
      }

More Works from Our Organization

SimWorld