RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users

1Johns Hopkins University 2Amazon.com
*Equal contribution
sequential instruction

What is the RealWebAssist Benchmark?

RealWebAssist is the first sequential instruction following benchmark that evaluates long-horizon web assistance with real-world users. It features:

  • ๐Ÿง  Real users: Instructions come from real-world users, not annotators.
  • ๐Ÿ“‹ Sequential tasks: Models follow long, evolving instruction sequences.
  • ๐ŸŒ Real websites: Tasks span diverse, real-life websites and GUIs.
  • ๐Ÿ–ฑ๏ธ GUI grounding: Agents must choose the right spot on the webpage.
  • ๐Ÿ—ฃ๏ธ Speech input: Includes spoken instructions along with ground truth captions.
  • ๐Ÿ” Real-world challenges: Ambiguity, context, planning, and routine learning.
  • ๐Ÿ“‰ Hard for SOTA models: Existing models struggle with the benchmark.

Examples of Tasks and Websites

RealWebAssist includes tasks collected from real users across shopping, food, entertainment, and travel websitesโ€”ranging from booking flights to ordering dinner or buying a gift.

Example Tasks and Websites

Challenges of RealWebAssist Benchmark

RealWebAssist features multiple challenges that could emerge in long-horizon web assistance with real-world users. These include spatial and temporal reasoning needed to understand ambiguous and context-dependent user instructions, planning for multiple steps of actions to reach the goal communicated by an instruction, and learning about user-specific routines.

Challenges of RealWebAssist Benchmark

BibTeX

@article{ye2025realwebassist,
      title={RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users},
      author={Ye, Suyu and Shi, Haojun and Shih, Darren and Yun, Hyokun and Roosta, Tanya and Shu, Tianmin},
      journal={arXiv preprint arXiv:2504.10445},
      year={2025}
    }