0
0
Large Language Model-Brained GUI Agents: A Survey
Overview
- Survey examining Large Language Models (LLMs) controlling graphical user interfaces
- Focuses on agents that can autonomously operate desktop and mobile applications
- Reviews challenges in developing LLM-powered GUI automation
- Analyzes current approaches and future research directions
- Evaluates real-world applications and limitations
LLM-powered GUI agent orchestrates actions across applications.
1/3
Original caption: Figure 1: Illustration of the high-level concept of an LLM-powered GUI agent. The agent receives a user’s natural language request and orchestrates actions seamlessly across multiple applications. It extracts information from Word documents, observes content in Photos, summarizes web pages in the browser, reads PDFs in Adobe Acrobat, and creates slides in PowerPoint before sending them through Teams.
Original caption: Figure 3: An overview of GUI agents evolution over years.
Original caption: Figure 4: An overview of the architecture and workflow of a basic LLM-powered GUI agent.
Alphabetical list of abbreviations.
1/2
Acronym | Explanation |
---|---|
AI | Artificial Intelligence |
AITW | Android in the Wild |
AITZ | Android in The Zoo |
API | Application Programming Interface |
CLI | Command-Line Interface |
CLIP | Contrastive Language-Image Pre-Training |
CoT | Chain-of-Thought |
CSS | Cascading Style Sheets |
CuP | Completion under Policy |
CV | Computer Vision |
DOM | Document Object Model |
DPO | Direct Preference Optimization |
GCC | General Computer Control |
GPT | Generative Pre-trained Transformers |
GUI | Graphical User Interface |
HCI | Human-Computer Interaction |
HTML | Hypertext Markup Language |
ICL | In-Context Learning |
IoU | Intersection over Union |
LAM | Large Action Model |
LLM | Large Language Model |
LSTM | Long Short-Term Memory |
LTM | Long-Term Memory |
MCTS | Monte Carlo Tree Search |
MoE | Mixture of Experts |
MDP | Markov Decision Process |
MLLM | Multimodal Large Language Model |
OCR | Optical Character Recognition |
OS | Operation System |
RAG | Retrieval-Augmented Generation |
ReAct | Reasoning and Acting |
RL | Reinforcement Learning |
RLHF | Reinforcement Learning from Human Feedback |
RNN | Recurrent Neural Network |
RPA | Robotic Process Automation |
UI | User Interface |
VAB | VisualAgentBench |
VLM | Visual Language Models |
ViT | Vision Transformer |
VQA | Visual Question Answering |
SAM | Segment Anything Model |
SoM | Set-of-Mark |
STM | Short-Trem Memory |
Original caption: TABLE I: List of abbreviations in alphabetical order.
Survey | One Sentence Summary | GUI | Automation | LLM Agent + GUI Automation |
---|---|---|---|---|
Survey | One Sentence Summary | GUI | Automation | LLM Agent + GUI Automation |
Li et al., [25] | A book on how to develop an automated GUI testing tool. | âś“ | ||
RodrĂguez et al., [26] | A survey on automated GUI testing in 30 years. | âś“ | ||
Arnatovich et al., [27] | A survey on automated techniques for mobile functional GUI testing. | âś“ | ||
Ivančić et al., [6] | A literature review on RPA. | ✓ |
Original caption: TABLE II: Summary of representative surveys and books on GUI automation and LLM agents. A âś“symbol indicates that a publication explicitly addresses a given domain, while an â—‹â—‹\bigcircâ—‹ symbol signifies that the publication does not focus on the area but offers relevant insights. Publications covering both GUI automation and LLM agents are highlighted for emphasis.
Plain English Explanation
Large language models are becoming capable of controlling computer interfaces just like humans do. Think of them as virtual assistants that can click buttons, type text, and navigate through apps on their own.
This research examines how these AI agents learn to use computer interfaces. It's similar to teaching someone to use a new app - the AI needs to understand what it sees on screen and figure out the right actions to take.
The technology combines computer vision to "see" the interface, language understanding to comprehend instructions, and decision-making to determine what actions to take. Like a new employee learning software through a training manual, these GUI agents learn from examples and documentation.
Key Findings
AI automation of graphical interfaces has made significant progress:
- LLMs can successfully navigate common desktop and mobile applications
- Current systems achieve 70-90% success rates on basic GUI tasks
- Agents work best with clear visual interfaces and explicit instructions
- Real-world applications include customer service, software testing, and workflow automation
Technical Explanation
The systems use a perception-action loop where the LLM:
- Receives visual information about the interface
- Processes natural language instructions
- Plans appropriate actions
- Executes commands through an interface controller
Key technical approaches include:
- Screenshot analysis using computer vision
- Semantic understanding of UI elements
- Action planning through language models
- Execution monitoring and error recovery
Critical Analysis
Several limitations exist in current approaches:
- Difficulty handling dynamic or complex interfaces
- Limited ability to recover from errors
- Challenges with long-term planning and memory
- Privacy and security concerns with automated interface control
The field needs better ways to:
- Handle edge cases and unexpected situations
- Maintain context over extended interactions
- Ensure safe and reliable operation
- Protect user data and system security
Conclusion
LLM-based interface agents represent a significant step toward automated computer operation. While showing promise in controlled environments, challenges remain for widespread deployment. Future research must address reliability, safety, and ethical concerns while expanding capabilities.
The technology could transform how humans interact with computers, potentially automating routine tasks and making software more accessible to non-technical users.
This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!
4
Related Papers
0
GUI Agents with Foundation Models: A Comprehensive Survey
Shuai Wang, Weiwen Liu, Jingxuan Chen, Weinan Gan, Xingshan Zeng, Shuai Yu, Xinlong Hao, Kun Shao, Yasheng Wang, Ruiming Tang
Recent advances in foundation models, particularly Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs), facilitate intelligent agents being capable of performing complex tasks. By leveraging the ability of (M)LLMs to process and interpret Graphical User Interfaces (GUIs), these agents can autonomously execute user instructions by simulating human-like interactions such as clicking and typing. This survey consolidates recent research on (M)LLM-based GUI agents, highlighting key innovations in data, frameworks, and applications. We begin by discussing representative datasets and benchmarks. Next, we summarize a unified framework that captures the essential components used in prior research, accompanied by a taxonomy. Additionally, we explore commercial applications of (M)LLM-based GUI agents. Drawing from existing work, we identify several key challenges and propose future research directions. We hope this paper will inspire further developments in the field of (M)LLM-based GUI agents.
Read more11/8/2024
0
A Survey on Large Language Model-Based Game Agents
Sihao Hu, Tiansheng Huang, Fatih Ilhan, Selim Tekin, Gaowen Liu, Ramana Kompella, Ling Liu
The development of game agents holds a critical role in advancing towards Artificial General Intelligence (AGI). The progress of LLMs and their multimodal counterparts (MLLMs) offers an unprecedented opportunity to evolve and empower game agents with human-like decision-making capabilities in complex computer game environments. This paper provides a comprehensive overview of LLM-based game agents from a holistic viewpoint. First, we introduce the conceptual architecture of LLM-based game agents, centered around six essential functional components: perception, memory, thinking, role-playing, action, and learning. Second, we survey existing representative LLM-based game agents documented in the literature with respect to methodologies and adaptation agility across six genres of games, including adventure, communication, competition, cooperation, simulation, and crafting & exploration games. Finally, we present an outlook of future research and development directions in this burgeoning field. A curated list of relevant papers is maintained and made accessible at: https://github.com/git-disl/awesome-LLM-game-agent-papers.
Read more4/3/2024
đź’¬
0
A Survey on Large Language Model based Autonomous Agents
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, Ji-Rong Wen
Autonomous agents have long been a prominent research focus in both academic and industry communities. Previous research in this field often focuses on training agents with limited knowledge within isolated environments, which diverges significantly from human learning processes, and thus makes the agents hard to achieve human-like decisions. Recently, through the acquisition of vast amounts of web knowledge, large language models (LLMs) have demonstrated remarkable potential in achieving human-level intelligence. This has sparked an upsurge in studies investigating LLM-based autonomous agents. In this paper, we present a comprehensive survey of these studies, delivering a systematic review of the field of LLM-based autonomous agents from a holistic perspective. More specifically, we first discuss the construction of LLM-based autonomous agents, for which we propose a unified framework that encompasses a majority of the previous work. Then, we present a comprehensive overview of the diverse applications of LLM-based autonomous agents in the fields of social science, natural science, and engineering. Finally, we delve into the evaluation strategies commonly used for LLM-based autonomous agents. Based on the previous studies, we also present several challenges and future directions in this field. To keep track of this field and continuously update our survey, we maintain a repository of relevant references at https://github.com/Paitesanshi/LLM-Agent-Survey.
Read more4/5/2024
🔍
0
Human-Centered LLM-Agent User Interface: A Position Paper
Daniel Chin, Yuxuan Wang, Gus Xia
Large Language Model (LLM) -in-the-loop applications have been shown to effectively interpret the human user's commands, make plans, and operate external tools/systems accordingly. Still, the operation scope of the LLM agent is limited to passively following the user, requiring the user to frame his/her needs with regard to the underlying tools/systems. We note that the potential of an LLM-Agent User Interface (LAUI) is much greater. A user mostly ignorant to the underlying tools/systems should be able to work with a LAUI to discover an emergent workflow. Contrary to the conventional way of designing an explorable GUI to teach the user a predefined set of ways to use the system, in the ideal LAUI, the LLM agent is initialized to be proficient with the system, proactively studies the user and his/her needs, and proposes new interaction schemes to the user. To illustrate LAUI, we present Flute X GPT, a concrete example using an LLM agent, a prompt manager, and a flute-tutoring multi-modal software-hardware system to facilitate the complex, real-time user experience of learning to play the flute.
Read more9/24/2024