GUing: A Mobile GUI Search Engine using a Vision-Language Model

Read original: arXiv:2405.00145 - Published 9/4/2024 by Jialiang Wei, Anne-Lise Courbis, Thomas Lambolais, Binbin Xu, Pierre Louis Bernard, G'erard Dray, Walid Maalej
Total Score

0

GUing: A Mobile GUI Search Engine using a Vision-Language Model

Sign in to get full access

or

If you already have an account, we'll log you in

Overview

• This paper introduces GUing, a mobile GUI search engine that uses a vision-language model to enable users to search for and interact with GUI elements on their mobile devices.

• The key contributions of the paper include a dataset of mobile GUI screenshots and annotations, as well as a novel vision-language model architecture that can understand and reason about GUI layouts.

Plain English Explanation

GUing is a tool that allows you to search for and interact with the different parts of the graphical user interface (GUI) on your mobile device.

• To build GUing, the researchers created a dataset of mobile app screenshots and annotated the various GUI elements in those images, like buttons, menus, and icons.

• They then developed a machine learning model that can "understand" the layout and structure of these GUI elements by looking at the images and learning the relationships between them.

• This allows users to search for specific GUI elements by describing them in natural language, like "the button to share this article", and the model can then find and highlight that element on the screen.

• The researchers believe this technology could be useful for helping users navigate and interact with complex mobile app interfaces more effectively.

Technical Explanation

• The researchers created a dataset of over 100,000 mobile app screenshots, each annotated with bounding boxes and textual descriptions for the various GUI elements present.

• They then used this dataset to train a vision-language model that can understand the composition and semantics of GUI layouts.

• This model is based on a graph neural network architecture that can capture the relationships between different GUI components.

• To enable natural language interaction, the researchers integrated this vision model with a language model trained on a large corpus of text.

• They demonstrated the capabilities of GUing through experiments on image retrieval and GUI element localization tasks.

Critical Analysis

• The researchers acknowledge that their dataset, while large, may not be fully representative of the diversity of mobile app interfaces in the real world.

• They also note that the performance of the vision-language model could be further improved by incorporating additional modalities, such as interaction logs or screen recordings.

• While the experiments demonstrate the potential of GUing, the researchers do not address potential privacy and security concerns that may arise from a system that can deeply analyze and interact with a user's mobile interface.

• Additional research is needed to understand the long-term implications of such a system and how it could be deployed responsibly to benefit users without compromising their privacy or autonomy.

Conclusion

GUing introduces a novel approach to mobile GUI search and interaction using advanced vision-language models, addressing the challenge of navigating complex app interfaces.

• The creation of a large, annotated dataset of mobile GUI screenshots and the development of a graph-based vision-language model are significant contributions to the field of human-computer interaction.

• While further research is needed to address the limitations and potential concerns, GUing showcases the potential for AI-powered tools to enhance the user experience and accessibility of mobile devices.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

GUing: A Mobile GUI Search Engine using a Vision-Language Model
Total Score

0

GUing: A Mobile GUI Search Engine using a Vision-Language Model

Jialiang Wei, Anne-Lise Courbis, Thomas Lambolais, Binbin Xu, Pierre Louis Bernard, G'erard Dray, Walid Maalej

App developers use the Graphical User Interface (GUI) of other apps as a source of inspiration for designing and improving their own apps. Recent research has thus suggested retrieving relevant GUI designs that match a certain text query from screenshot datasets acquired through crowdsourced or automated exploration of GUIs. However, such text-to-GUI retrieval approaches only leverage the textual information of the GUI elements, neglecting visual information such as icons or background images. In addition, retrieved screenshots are not steered by app developers and often lack important app features that require particular input data. To overcome these limitations, this paper proposes GUing, a GUI search engine based on a vision-language model called GUIClip, which we trained specifically for the problem of designing app GUIs. For this, we first collected from Google Play app introduction images which usually display the most representative screenshots and are often captioned (i.e.~labeled) by app vendors. Then, we developed an automated pipeline to classify, crop, and extract the captions from these images. This resulted in a large dataset which we share with this paper: including 303k app screenshots, out of which 135k have captions. We used this dataset to train a novel vision-language model, which is, to the best of our knowledge, the first of its kind in GUI retrieval. We evaluated our approach on various datasets from related work and in manual experiment. The results demonstrate that our model outperforms previous approaches in text-to-GUI retrieval achieving a Recall@10 of up to 0.69 and a HIT@10 of 0.91. We also explored the performance of GUIClip for other GUI tasks including GUI classification and sketch-to-GUI retrieval with encouraging results.

Read more

9/4/2024

GUICourse: From General Vision Language Models to Versatile GUI Agents
Total Score

0

GUICourse: From General Vision Language Models to Versatile GUI Agents

Wentong Chen, Junbo Cui, Jinyi Hu, Yujia Qin, Junjie Fang, Yue Zhao, Chongyi Wang, Jun Liu, Guirong Chen, Yupeng Huo, Yuan Yao, Yankai Lin, Zhiyuan Liu, Maosong Sun

Utilizing Graphic User Interface (GUI) for human-computer interaction is essential for accessing a wide range of digital tools. Recent advancements in Vision Language Models (VLMs) highlight the compelling potential to develop versatile agents to help humans finish GUI navigation tasks. However, current VLMs are challenged in terms of fundamental abilities (OCR and grounding) and GUI knowledge (the functions and control methods of GUI elements), preventing them from becoming practical GUI agents. To solve these challenges, we contribute GUICourse, a suite of datasets to train visual-based GUI agents from general VLMs. First, we introduce the GUIEnv dataset to strengthen the OCR and grounding capabilities of VLMs. Then, we introduce the GUIAct and GUIChat datasets to enrich their knowledge of GUI components and interactions. Experiments demonstrate that our GUI agents have better performance on common GUI tasks than their baseline VLMs. Even the small-size GUI agent (with 3.1B parameters) can still work well on single-step and multi-step GUI tasks. Finally, we analyze the different varieties in the training stage of this agent by ablation study. Our source codes and datasets are released at https://github.com/yiye3/GUICourse.

Read more

6/18/2024

UIClip: A Data-driven Model for Assessing User Interface Design
Total Score

0

UIClip: A Data-driven Model for Assessing User Interface Design

Jason Wu, Yi-Hao Peng, Amanda Li, Amanda Swearngin, Jeffrey P. Bigham, Jeffrey Nichols

User interface (UI) design is a difficult yet important task for ensuring the usability, accessibility, and aesthetic qualities of applications. In our paper, we develop a machine-learned model, UIClip, for assessing the design quality and visual relevance of a UI given its screenshot and natural language description. To train UIClip, we used a combination of automated crawling, synthetic augmentation, and human ratings to construct a large-scale dataset of UIs, collated by description and ranked by design quality. Through training on the dataset, UIClip implicitly learns properties of good and bad designs by i) assigning a numerical score that represents a UI design's relevance and quality and ii) providing design suggestions. In an evaluation that compared the outputs of UIClip and other baselines to UIs rated by 12 human designers, we found that UIClip achieved the highest agreement with ground-truth rankings. Finally, we present three example applications that demonstrate how UIClip can facilitate downstream applications that rely on instantaneous assessment of UI design quality: i) UI code generation, ii) UI design tips generation, and iii) quality-aware UI example search.

Read more

4/24/2024

On AI-Inspired UI-Design
Total Score

0

On AI-Inspired UI-Design

Jialiang Wei, Anne-Lise Courbis, Thomas Lambolais, G'erard Dray, Walid Maalej

Graphical User Interface (or simply UI) is a primary mean of interaction between users and their device. In this paper, we discuss three major complementary approaches on how to use Artificial Intelligence (AI) to support app designers create better, more diverse, and creative UI of mobile apps. First, designers can prompt a Large Language Model (LLM) like GPT to directly generate and adjust one or multiple UIs. Second, a Vision-Language Model (VLM) enables designers to effectively search a large screenshot dataset, e.g. from apps published in app stores. The third approach is to train a Diffusion Model (DM) specifically designed to generate app UIs as inspirational images. We discuss how AI should be used, in general, to inspire and assist creative app design rather than automating it.

Read more

6/21/2024