We propose CaPE (Code as Path Editor), a safe and interpretable multimodal path planning framework for multi-agent cooperation. CaPE enables robots to adapt motion plans in response to natural language communication by generating structured, interpretable path-editing programs that are validated by a model-based planner.
Core idea: Use a vision-language model to synthesize structured path-editing programs, then apply planner-based verification to ensure safety. This enables open-ended language-driven coordination while preserving robustness and interpretability.
CaPE enables safe and interpretable language-guided coordination across multi-robot interactions, household human-robot interaction, and real-world human-robot joint carrying tasks.
Human-Robot Teaming: Without CaPE, the human and robot collide at the doorway because they cannot communicate to coordinate their movements. With CaPE, the robot understands the human instruction to move aside and allow passage, successfully avoiding the collision.
Human-Robot Joint Lifting: With CaPE, the robot interprets the human’s verbal instruction and adapts its motion, taking the rightmost path while avoiding obstacles.