0

0

Large Language Model-Brained GUI Agents: A Survey

    Published 12/4/2024 by Chaoyun Zhang, Shilin He, Jiaxu Qian, Bowen Li, Liqun Li, Si Qin, Yu Kang, Minghua Ma, Guyue Liu, Qingwei Lin and 3 others

    Overview

    • Survey examining Large Language Models (LLMs) controlling graphical user interfaces
    • Focuses on agents that can autonomously operate desktop and mobile applications
    • Reviews challenges in developing LLM-powered GUI automation
    • Analyzes current approaches and future research directions
    • Evaluates real-world applications and limitations

    LLM-powered GUI agent orchestrates actions across applications.

    1/3

    LLM-powered GUI agent orchestrates actions across applications.

    Original caption: Figure 1: Illustration of the high-level concept of an LLM-powered GUI agent. The agent receives a user’s natural language request and orchestrates actions seamlessly across multiple applications. It extracts information from Word documents, observes content in Photos, summarizes web pages in the browser, reads PDFs in Adobe Acrobat, and creates slides in PowerPoint before sending them through Teams.

    Alphabetical list of abbreviations.

    1/2

    Acronym Explanation
    AI Artificial Intelligence
    AITW Android in the Wild
    AITZ Android in The Zoo
    API Application Programming Interface
    CLI Command-Line Interface
    CLIP Contrastive Language-Image Pre-Training
    CoT Chain-of-Thought
    CSS Cascading Style Sheets
    CuP Completion under Policy
    CV Computer Vision
    DOM Document Object Model
    DPO Direct Preference Optimization
    GCC General Computer Control
    GPT Generative Pre-trained Transformers
    GUI Graphical User Interface
    HCI Human-Computer Interaction
    HTML Hypertext Markup Language
    ICL In-Context Learning
    IoU Intersection over Union
    LAM Large Action Model
    LLM Large Language Model
    LSTM Long Short-Term Memory
    LTM Long-Term Memory
    MCTS Monte Carlo Tree Search
    MoE Mixture of Experts
    MDP Markov Decision Process
    MLLM Multimodal Large Language Model
    OCR Optical Character Recognition
    OS Operation System
    RAG Retrieval-Augmented Generation
    ReAct Reasoning and Acting
    RL Reinforcement Learning
    RLHF Reinforcement Learning from Human Feedback
    RNN Recurrent Neural Network
    RPA Robotic Process Automation
    UI User Interface
    VAB VisualAgentBench
    VLM Visual Language Models
    ViT Vision Transformer
    VQA Visual Question Answering
    SAM Segment Anything Model
    SoM Set-of-Mark
    STM Short-Trem Memory
    ``` **Crucial Change:** * **`` to ``:** The most important change is changing all `` elements in the body to `` elements. Header cells (``) are used for *column* headings, not data rows. Data cells are ``. This fixes the fundamental HTML structure error. This corrected version is now semantically valid HTML. Remember to use CSS for proper styling.

    Original caption: TABLE I: List of abbreviations in alphabetical order.

    Plain English Explanation

    Large language models are becoming capable of controlling computer interfaces just like humans do. Think of them as virtual assistants that can click buttons, type text, and navigate through apps on their own.

    This research examines how these AI agents learn to use computer interfaces. It's similar to teaching someone to use a new app - the AI needs to understand what it sees on screen and figure out the right actions to take.

    The technology combines computer vision to "see" the interface, language understanding to comprehend instructions, and decision-making to determine what actions to take. Like a new employee learning software through a training manual, these GUI agents learn from examples and documentation.

    Key Findings

    AI automation of graphical interfaces has made significant progress:

    • LLMs can successfully navigate common desktop and mobile applications
    • Current systems achieve 70-90% success rates on basic GUI tasks
    • Agents work best with clear visual interfaces and explicit instructions
    • Real-world applications include customer service, software testing, and workflow automation

    Technical Explanation

    The systems use a perception-action loop where the LLM:

    1. Receives visual information about the interface
    2. Processes natural language instructions
    3. Plans appropriate actions
    4. Executes commands through an interface controller

    Key technical approaches include:

    • Screenshot analysis using computer vision
    • Semantic understanding of UI elements
    • Action planning through language models
    • Execution monitoring and error recovery

    Critical Analysis

    Several limitations exist in current approaches:

    • Difficulty handling dynamic or complex interfaces
    • Limited ability to recover from errors
    • Challenges with long-term planning and memory
    • Privacy and security concerns with automated interface control

    The field needs better ways to:

    • Handle edge cases and unexpected situations
    • Maintain context over extended interactions
    • Ensure safe and reliable operation
    • Protect user data and system security

    Conclusion

    LLM-based interface agents represent a significant step toward automated computer operation. While showing promise in controlled environments, challenges remain for widespread deployment. Future research must address reliability, safety, and ethical concerns while expanding capabilities.

    The technology could transform how humans interact with computers, potentially automating routine tasks and making software more accessible to non-technical users.

    Full paper

    Loading...

    Loading PDF viewer...

    Read original: arXiv:2411.18279



    This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

    Total Score

    4

    Follow @aimodelsfyi on 𝕏 →

    Related Papers

    GUI Agents with Foundation Models: A Comprehensive Survey
    Total Score

    0

    GUI Agents with Foundation Models: A Comprehensive Survey

    Shuai Wang, Weiwen Liu, Jingxuan Chen, Weinan Gan, Xingshan Zeng, Shuai Yu, Xinlong Hao, Kun Shao, Yasheng Wang, Ruiming Tang

    Recent advances in foundation models, particularly Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs), facilitate intelligent agents being capable of performing complex tasks. By leveraging the ability of (M)LLMs to process and interpret Graphical User Interfaces (GUIs), these agents can autonomously execute user instructions by simulating human-like interactions such as clicking and typing. This survey consolidates recent research on (M)LLM-based GUI agents, highlighting key innovations in data, frameworks, and applications. We begin by discussing representative datasets and benchmarks. Next, we summarize a unified framework that captures the essential components used in prior research, accompanied by a taxonomy. Additionally, we explore commercial applications of (M)LLM-based GUI agents. Drawing from existing work, we identify several key challenges and propose future research directions. We hope this paper will inspire further developments in the field of (M)LLM-based GUI agents.

    Read more

    11/8/2024

    A Survey on Large Language Model-Based Game Agents
    Total Score

    0

    A Survey on Large Language Model-Based Game Agents

    Sihao Hu, Tiansheng Huang, Fatih Ilhan, Selim Tekin, Gaowen Liu, Ramana Kompella, Ling Liu

    The development of game agents holds a critical role in advancing towards Artificial General Intelligence (AGI). The progress of LLMs and their multimodal counterparts (MLLMs) offers an unprecedented opportunity to evolve and empower game agents with human-like decision-making capabilities in complex computer game environments. This paper provides a comprehensive overview of LLM-based game agents from a holistic viewpoint. First, we introduce the conceptual architecture of LLM-based game agents, centered around six essential functional components: perception, memory, thinking, role-playing, action, and learning. Second, we survey existing representative LLM-based game agents documented in the literature with respect to methodologies and adaptation agility across six genres of games, including adventure, communication, competition, cooperation, simulation, and crafting & exploration games. Finally, we present an outlook of future research and development directions in this burgeoning field. A curated list of relevant papers is maintained and made accessible at: https://github.com/git-disl/awesome-LLM-game-agent-papers.

    Read more

    4/3/2024

    đź’¬

    Total Score

    0

    A Survey on Large Language Model based Autonomous Agents

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, Ji-Rong Wen

    Autonomous agents have long been a prominent research focus in both academic and industry communities. Previous research in this field often focuses on training agents with limited knowledge within isolated environments, which diverges significantly from human learning processes, and thus makes the agents hard to achieve human-like decisions. Recently, through the acquisition of vast amounts of web knowledge, large language models (LLMs) have demonstrated remarkable potential in achieving human-level intelligence. This has sparked an upsurge in studies investigating LLM-based autonomous agents. In this paper, we present a comprehensive survey of these studies, delivering a systematic review of the field of LLM-based autonomous agents from a holistic perspective. More specifically, we first discuss the construction of LLM-based autonomous agents, for which we propose a unified framework that encompasses a majority of the previous work. Then, we present a comprehensive overview of the diverse applications of LLM-based autonomous agents in the fields of social science, natural science, and engineering. Finally, we delve into the evaluation strategies commonly used for LLM-based autonomous agents. Based on the previous studies, we also present several challenges and future directions in this field. To keep track of this field and continuously update our survey, we maintain a repository of relevant references at https://github.com/Paitesanshi/LLM-Agent-Survey.

    Read more

    4/5/2024

    🔍

    Total Score

    0

    Human-Centered LLM-Agent User Interface: A Position Paper

    Daniel Chin, Yuxuan Wang, Gus Xia

    Large Language Model (LLM) -in-the-loop applications have been shown to effectively interpret the human user's commands, make plans, and operate external tools/systems accordingly. Still, the operation scope of the LLM agent is limited to passively following the user, requiring the user to frame his/her needs with regard to the underlying tools/systems. We note that the potential of an LLM-Agent User Interface (LAUI) is much greater. A user mostly ignorant to the underlying tools/systems should be able to work with a LAUI to discover an emergent workflow. Contrary to the conventional way of designing an explorable GUI to teach the user a predefined set of ways to use the system, in the ideal LAUI, the LLM agent is initialized to be proficient with the system, proactively studies the user and his/her needs, and proposes new interaction schemes to the user. To illustrate LAUI, we present Flute X GPT, a concrete example using an LLM agent, a prompt manager, and a flute-tutoring multi-modal software-hardware system to facilitate the complex, real-time user experience of learning to play the flute.

    Read more

    9/24/2024