RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users

What is the RealWebAssist Benchmark?

RealWebAssist is the first sequential instruction following benchmark that evaluates long-horizon web assistance with real-world users. It features:

🧠 Real users: Instructions come from real-world users, not annotators.
📋 Sequential tasks: Models follow long, evolving instruction sequences.
🌐 Real websites: Tasks span diverse, real-life websites and GUIs.
🖱️ GUI grounding: Agents must choose the right spot on the webpage.
🗣️ Speech input: Includes spoken instructions along with ground truth captions.
🔍 Real-world challenges: Ambiguity, context, planning, and routine learning.
📉 Hard for SOTA models: Existing models struggle with the benchmark.

Examples of Tasks and Websites

RealWebAssist includes tasks collected from real users across shopping, food, entertainment, and travel websites—ranging from booking flights to ordering dinner or buying a gift.

Challenges of RealWebAssist Benchmark

RealWebAssist features multiple challenges that could emerge in long-horizon web assistance with real-world users. These include spatial and temporal reasoning needed to understand ambiguous and context-dependent user instructions, planning for multiple steps of actions to reach the goal communicated by an instruction, and learning about user-specific routines.

BibTeX

@article{ye2025realwebassist,
      title={RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users},
      author={Ye, Suyu and Shi, Haojun and Shih, Darren and Yun, Hyokun and Roosta, Tanya and Shu, Tianmin},
      journal={arXiv preprint arXiv:2504.10445},
      year={2025}
    }

RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users

AAAI 2026

What is the RealWebAssist Benchmark?

Examples of Tasks and Websites

Challenges of RealWebAssist Benchmark

BibTeX