-
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
- Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, Caiming Xiong
- 🏛️ Institutions: HKU, NTU, Salesforce
- 📅 Date: Dec 5, 2024
- 📑 Publisher: arXiv
- 💻 Env: [GUI]
- 🔑 Key: [model], [dataset], [planning], [reasoning], [Aguvis], [visual grounding]
- 📖 TLDR: This paper introduces Aguvis, a unified pure vision-based framework for autonomous GUI agents that operates across various platforms. It leverages image-based observations and grounds natural language instructions to visual elements, employing a consistent action space to ensure cross-platform generalization. The approach integrates explicit planning and reasoning within the model, enhancing its ability to autonomously navigate and interact with complex digital environments. A large-scale dataset of GUI agent trajectories is constructed, incorporating multimodal reasoning and grounding. Comprehensive experiments demonstrate that Aguvis surpasses previous state-of-the-art methods in both offline and real-world online scenarios, achieving the first fully autonomous pure vision GUI agent capable of performing tasks independently without collaboration with external closed-source models. All datasets, models, and training recipes are open-sourced to facilitate future research.
-
The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use
- Siyuan Hu, Mingyu Ouyang, Difei Gao, Mike Zheng Shou
- 🏛️ Institutions: NUS
- 📅 Date: Nov 15, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Desktop]
- 🔑 Key: [framework], [Claude 3.5 Computer Use], [GUI automation], [planning], [action], [critic]
- 📖 TLDR: This study evaluates Claude 3.5 Computer Use, an AI model enabling end-to-end language-to-desktop actions, through curated tasks across various domains. It introduces an out-of-the-box framework for deploying API-based GUI automation models, analyzing the model's planning, action execution, and adaptability to dynamic environments.
-
Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents
- Yu Gu, Boyuan Zheng, Boyu Gou, Kai Zhang, Cheng Chang, Sanjari Srivastava, Yanan Xie, Peng Qi, Huan Sun, Yu Su
- 🏛️ Institutions: OSU, Orby AI
- 📅 Date: November 10, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Web]
- 🔑 Key: [framework], [WebDreamer], [model-based planning], [world model]
- 📖 TLDR: This paper investigates whether Large Language Models (LLMs) can function as world models within web environments, enabling model-based planning for web agents. Introducing WebDreamer, a framework that leverages LLMs to simulate potential action sequences in web environments, the study demonstrates significant performance improvements over reactive baselines on benchmarks like VisualWebArena and Mind2Web-live. The findings suggest that LLMs possess the capability to model the dynamic nature of the internet, paving the way for advancements in automated web interaction and opening new research avenues in optimizing LLMs for complex, evolving environments.
-
Language Agents: Foundations, Prospects, and Risks
- Yu Su, Diyi Yang, Shunyu Yao, Tao Yu
- 🏛️ Institutions: OSU, Stanford, Princeton, HKU
- 📅 Date: November 2024
- 📑 Publisher: EMNLP 2024
- 💻 Env: [Misc]
- 🔑 Key: [survey], [tutorial], [reasoning], [planning], [memory], [multi-agent systems], [safty]
- 📖 TLDR: This tutorial provides a comprehensive exploration of language agents—autonomous systems powered by large language models capable of executing complex tasks through language instructions. It delves into their theoretical foundations, potential applications, associated risks, and future directions, covering topics such as reasoning, memory, planning, tool augmentation, grounding, multi-agent systems, and safety considerations.
-
AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?
- Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, Jonathan Berant
- 🏛️ Institutions: Tel Aviv University
- 📅 Date: October 21, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Web]
- 🔑 Key: [benchmark], [dataset], [planning and reasoning]
- 📖 TLDR: AssistantBench is a benchmark designed to test the abilities of web agents in completing time-intensive, realistic web-based tasks. Covering 214 tasks across various domains, the benchmark introduces the SPA (See-Plan-Act) framework to handle multi-step planning and memory retention. AssistantBench emphasizes realistic task completion, showing that current agents achieve only modest success, with significant improvements needed for complex information synthesis and execution across multiple web domains.
-
Agent S: An Open Agentic Framework that Uses Computers Like a Human
- Saaket Agashe, Jiuzhou Han, Shuyu Gan, Jiachen Yang, Ang Li, Xin Eric Wang
- 🏛️ Institutions: Simular Research
- 📅 Date: October 10, 2024
- 📑 Publisher: arXiv
- 💻 Env: [GUI]
- 🔑 Key: [framework], [autonomous GUI interaction], [experience-augmented hierarchical planning]
- 📖 TLDR: This paper introduces Agent S, an open agentic framework that enables autonomous interaction with computers through a Graphical User Interface (GUI). The system addresses key challenges in automating computer tasks through experience-augmented hierarchical planning and an Agent-Computer Interface (ACI). Agent S demonstrates significant improvements over baselines on the OSWorld benchmark, achieving a 20.58% success rate (83.6% relative improvement). The framework shows generalizability across different operating systems and provides insights for developing more effective GUI agents.
-
Dynamic Planning for LLM-based Graphical User Interface Automation
- Shaoqing Zhang, Zhuosheng Zhang, Kehai Chen, Xinbei Ma, Muyun Yang, Tiejun Zhao, Min Zhang
- 🏛️ Institutions: SJTU
- 📅 Date: October 1, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [framework], [dynamic planning]
- 📖 TLDR: This paper introduces a novel method called Dynamic Planning of Thoughts (D-PoT) aimed at enhancing LLM-based agents for GUI tasks. It addresses the challenges of task execution by dynamically adjusting planning based on environmental feedback and action history, outperforming existing methods such as ReAct by improving accuracy significantly in navigating GUI environments. The study emphasizes the importance of integrating execution history and contextual cues to optimize decision-making processes for autonomous agents.
-
From Grounding to Planning: Benchmarking Bottlenecks in Web Agents
- Segev Shlomov, Ben Wiesel, Aviad Sela, Ido Levy, Liane Galanti, Roy Abitbol
- 🏛️ Institutions: IBM
- 📅 Date: September 3, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Web]
- 🔑 Key: [benchmark], [planning], [grounding], [Mind2Web dataset], [web navigation]
- 📖 TLDR: This paper analyzes performance bottlenecks in web agents by separately evaluating grounding and planning tasks, isolating their individual impacts on navigation efficacy. Using an enhanced version of the Mind2Web dataset, the study reveals planning as a significant bottleneck, with advancements in grounding and task-specific benchmarking for elements like UI component recognition. Through experimental adjustments, the authors propose a refined evaluation framework, aiming to enhance web agents' contextual adaptability and accuracy in complex web environments.
-
WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks
- Léo Boisvert, Megh Thakkar, Maxime Gasse, Massimo Caccia, Thibault Le Sellier De Chezelles, Quentin Cappart, Nicolas Chapados, Alexandre Lacoste, Alexandre Drouin
- 🏛️ Institutions: ServiceNow Research, Mila, Polytechnique Montréal, Université de Montréal
- 📅 Date: July 7, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Web]
- 🔑 Key: [benchmark], [planning], [reasoning], [WorkArena++]
- 📖 TLDR: This paper introduces WorkArena++, a benchmark comprising 682 tasks that simulate realistic workflows performed by knowledge workers. It evaluates web agents' capabilities in planning, problem-solving, logical/arithmetic reasoning, retrieval, and contextual understanding. The study reveals challenges faced by current large language models and vision-language models in serving as effective workplace assistants, providing a resource to advance autonomous agent development. oai_citation_attribution:0‡arXiv
-
Octo-planner: On-device Language Model for Planner-Action Agents
- Nexa AI Team
- 🏛️ Institutions: Nexa AI
- 📅 Date: June 26, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Misc]
- 🔑 Key: [model], [framework], [Octo-planner], [on-device], [planning]
- 📖 TLDR: This paper presents Octo-planner, an on-device planning model designed for the Planner-Action Agents Framework. Octo-planner utilizes a fine-tuned model based on Phi-3 Mini (3.8 billion parameters) for high efficiency and low power consumption. It separates planning and action execution into two distinct components: a planner agent optimized for edge devices and an action agent using the Octopus model for function execution. The model achieves a planning success rate of 98.1% on benchmark datasets, providing reliable and effective performance.
-
- Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, Jitao Sang
- 🏛️ Institutions: Alibaba Group, Beijing University of Posts and Telecommunications
- 📅 Date: June 3, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [framework], [multi-agent], [planning], [decision-making], [reflection]
- 📖 TLDR: The paper presents Mobile-Agent-v2, a multi-agent architecture designed to assist with mobile device operations. It comprises three agents: a planning agent that generates task progress, a decision agent that navigates tasks using a memory unit, and a reflection agent that corrects erroneous operations. This collaborative approach addresses challenges in navigation and long-context input scenarios, achieving over a 30% improvement in task completion compared to single-agent architectures.
-
VideoGUI: A Benchmark for GUI Automation from Instructional Videos
- Kevin Qinghong Lin, Linjie Li, Difei Gao, Qinchen WU, Mingyi Yan, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou
- 🏛️ Institutions: NUS, Microsoft Gen AI
- 📅 Date: June 2024
- 📑 Publisher: NeurIPS 2024
- 💻 Env: [Desktop, Web]
- 🔑 Key: [benchmark], [instructional videos], [visual planning], [hierarchical task decomposition], [complex software interaction]
- 📖 TLDR: VideoGUI presents a benchmark for evaluating GUI automation on tasks derived from instructional videos, focusing on visually intensive applications like Adobe Photoshop and video editing software. The benchmark includes 178 tasks, with a hierarchical evaluation method distinguishing high-level planning, mid-level procedural steps, and precise action execution. VideoGUI reveals current model limitations in complex visual tasks, marking a significant step toward improved visual planning in GUI automation.
-
On the Multi-turn Instruction Following for Conversational Web Agents
- Yang Deng, Xuan Zhang, Wenxuan Zhang, Yifei Yuan, See-Kiong Ng, Tat-Seng Chua
- 🏛️ Institutions: NUS, DAMO Academy, University of Copenhagen
- 📅 Date: February 23, 2024
- 📑 Publisher: ACL 2024
- 💻 Env: [Web]
- 🔑 Key: [benchmark], [dataset], [multi-turn dialogue], [memory utilization], [self-reflective planning]
- 📖 TLDR: This paper explores multi-turn conversational web navigation, introducing the MT-Mind2Web dataset to support instruction-following tasks for web agents. The proposed Self-MAP (Self-Reflective Memory-Augmented Planning) framework enhances agent performance by integrating memory with self-reflection for sequential decision-making in complex interactions. Extensive evaluations using MT-Mind2Web demonstrate Self-MAP's efficacy in addressing the limitations of current models in multi-turn interactions, providing a novel dataset and framework for evaluating and training agents on detailed, multi-step web-based tasks.
-
MindSearch: Mimicking Human Minds Elicits Deep AI Searcher
- Zehui Chen, Kuikun Liu, Qiuchen Wang, Jiangning Liu, Wenwei Zhang, Kai Chen, Feng Zhao
- 🏛️ Institutions: USTC, Shanghai AI Lab
- 📅 Date: July 29, 2023
- 📑 Publisher: arXiv
- 💻 Env: [Web]
- 🔑 Key: [framework], [information seeking], [planning], [AI search], [MindSearch]
- 📖 TLDR: This paper presents MindSearch, a novel approach to web information seeking and integration that mimics human cognitive processes. The system uses a multi-agent framework consisting of a WebPlanner and WebSearcher. The WebPlanner models multi-step information seeking as a dynamic graph construction process, decomposing complex queries into sub-questions. The WebSearcher performs hierarchical information retrieval for each sub-question. MindSearch demonstrates significant improvements in response quality and depth compared to existing AI search solutions, processing information from over 300 web pages in just 3 minutes.