Reference resolution is an important problem, one that is essential to understand and successfully handle context of different kinds. This context includes both previous turns and context that pertains to non-conversational entities, such as entities on the user's screen or those running in the background. While LLMs have been shown to be extremely powerful for a variety of tasks, their use in reference resolution, particularly for non-conversational entities, remains underutilized. This paper demonstrates how LLMs can be used to create an extremely effective system to resolve references of various types, by showing how reference resolution can be converted into a language modeling problem, despite involving forms of entities like those on screen that are not traditionally conducive to being reduced to a text-only modality. We demonstrate large improvements over an existing system with similar functionality across different types of references, with our smallest model obtaining absolute gains of over 5% for on-screen references. We also benchmark against GPT-3.5 and GPT-4, with our smallest model achieving performance comparable to that of GPT-4, and our larger models substantially outperforming it.

## Introduction

The paper discusses the importance of understanding context, including ambiguous references, for conversational assistants to communicate effectively with users. While recent large language models (LLMs) can handle some contextual understanding, the authors argue for the continued value of traditional NLP pipelines in certain scenarios.

Specifically, they highlight four cases where a pipeline approach may be preferable: 1) when running on low-power devices with limited computing resources, making large end-to-end models infeasible; 2) when integrating with existing APIs or components, overhauling to a single LLM may be cumbersome; 3) a modular approach allows swapping improved reference resolution modules transparently; and 4) reference resolution needs to handle not just conversational context but also on-screen context.

The authors propose using smaller, fine-tuned language models specifically for reference resolution. To handle on-screen context, they suggest reconstructing the screen as a textual representation with entities tagged, allowing the language model to "see" and resolve references to on-screen elements. This is presented as a novel approach for enabling LLMs to understand on-screen context.

## Related Work and Motivation

The paper discusses the need for conversational agents to understand on-screen references, which differ from visual and deictic references. On-screen references tend to be more structured and textual, enabling a text-only approach without visual components. They are also often action-oriented rather than question-answering based, and use synthetic screens rather than natural images.

The paper notes that jointly handling conversational and on-screen references has been relatively unexplored. While vision transformers and pre-trained models have gained prominence for visual understanding tasks, they are trained on natural images rather than screenshots, have different distributions, and can be computationally expensive.

The paper identifies limitations in existing approaches, such as relying on manually onboarding new entity types, treating each type distinctly without leveraging similarities, using hand-crafted rules and heuristics that lack robustness and semantic understanding, and classifying entity relevance independently without considering the whole screen context.

## Task

The task involves identifying relevant entities from 3 types: on-screen entities currently displayed, conversational entities mentioned previously, and background entities from processes not directly visible. The goal is to extract the entities pertinent to the user's current query. The task is framed as a multiple choice problem, where the model outputs the relevant entities from the options shown on the user's screen, or "None of these" if applicable. During evaluation, any permutation of the correct entities is accepted as a valid answer.

## Datasets

The datasets used contain user queries and a list of entities, along with the ground-truth relevant entity(ies) for each query. Each entity has information like type, name, and other details. If on-screen context exists, the entity's bounding box and surrounding objects/properties are included.

For conversational data, annotators were shown synthetic entity lists and asked to provide queries that unambiguously reference a chosen entity. For example, referring to a specific business by saying "Take me to the one second from the bottom."

Synthetic data was generated from templates - one with mentions, entities, slot values, and another with query variations for the entity references defined in the first template. Queries were created by substituting the references.

For on-screen data, annotators first classified displayed information into entity types like phone numbers and emails, then provided queries for that information. In a second phase, other annotators identified which entity in the list was referenced by each query.

## Models

The paper compares the proposed model ReALM with two baseline approaches: a re-implementation of the reference resolver from a previous paper (MARRS), and ChatGPT using GPT-3.5 and GPT-4. 

For the MARRS baseline, they trained a re-implementation of the system proposed in a previous paper, which is not based on large language models. This baseline was specifically designed for reference resolution.

For the ChatGPT baseline, they used the GPT-3.5 and GPT-4 versions, providing just the text prompt for GPT-3.5 and the prompt along with a screenshot for GPT-4's image capabilities for on-screen reference resolution.

Their approach, ReALM, involves fine-tuning a FLAN-T5 language model. They convert the data into a sentence format to feed to the model, shuffling the entities to prevent overfitting to positions. For conversational references, they assume two types: type-based (relying on entity types) and descriptive (relying on properties). Their encoding captures both types and properties.

For on-screen references, they assume upstream detectors can parse screen text and extract entities with types, bounding boxes, and surrounding text. They use a novel algorithm to encode the screen layout as text based on sorting bounding box centers top-to-bottom and left-to-right with line breaks.

## Results

The paper presents their experimental results, indicating that their proposed approach outperforms the MARRS model across all types of datasets. It also surpasses the performance of GPT-3.5, which has a significantly larger number of parameters. Their approach performs comparably to the latest GPT-4, despite being a much lighter and faster model. The gains are particularly notable on onscreen datasets, where their model with textual encoding performs almost as well as GPT-4, even though the latter is provided with screenshots.

The authors also experiment with models of different sizes, observing that performance improves across all datasets as model size increases, but the difference is most pronounced for the complex onscreen datasets.

In an analysis section, the paper explores the zero-shot performance of their model on an unseen domain, Alarms. Their approach and GPT-4 perform similarly well on this unseen test set, outperforming a finetuned model.

The paper also highlights that their model, ReaLM, demonstrates superior understanding of domain-specific queries compared to GPT-4 due to finetuning on user requests. An example illustrates GPT-4 incorrectly assuming a reference is only about a setting, while the ground truth includes a home automation device, which ReaLM can recognize due to its domain-specific training.

## Conclusion and Future Work

The paper demonstrates how large language models can be used for reference resolution by encoding entities as natural text. A novel approach represents on-screen entities and their relative positions in a textual format, which is then passed to the language model. This method, called ReaLM, outperforms previous approaches and performs comparably to GPT-4, despite having fewer parameters. ReaLM even surpasses GPT-4 for domain-specific user utterances, making it an ideal choice for practical reference resolution systems that can run on-device without compromising performance.

However, while ReaLM effectively encodes the position of entities, it may lose nuanced positional information required for complex user queries. The authors suggest exploring more complex approaches, such as dividing the screen into a grid and encoding relative spatial positions into text, as a promising avenue for future research.

## Ethics Statement

The system allows for constraining the language model's output or applying post-processing to prevent unexpected generations. However, the authors state that in practice, they encounter very little hallucination or fabricated content from the language model. As a result, they do not constrain the model's decoding or generation process.

## Acknowledgements

The authors express gratitude to Stephen Pulman, Leon Liyang Zhang, Jiarui Lu, Jeff Nichols, Shruti Bhargava, Dhivya Piraviperumal, and Junhan Chen for their assistance and feedback throughout the research process.

## Appendix A Encoding onscreen entities

The paper presents visual examples of how screen grabs might appear when parsed and processed by the system. These sample representations are displayed in Figure 2 of the paper.

![(a) Onscreen Capture 1](https://arxiv.org/html/2403.20329v1/x1.png)

*(a) Onscreen Capture 1*

The paper explores different strategies for encoding on-screen elements. 

1. Clustering: Objects on the screen are spatially clustered into semantic groups. Users can refer to nearby bounding boxes by a descriptive title. However, as the number of entities in a cluster increases, the prompt length explodes since each object lists all other objects as surrounding entities.

2. Onscreen Grab: The screen is parsed, but turn objects are provided as a separate list instead of being annotated within the parse.

3. Onscreen Grab with Injected Turn Objects: This is the final approach used. The screen is parsed, and turn objects are annotated within the parse itself.

The paper presents an algorithm for the final approach and provides sample encodings for each strategy. It also includes an ablation study comparing the performance of the different encoding approaches.

![Figure 3: Performance improvements with each experiment – (a) Baseline Finetuned LLM, (b) Obtaining screen elements through OCR,
(c) Obtaining screen elements through UI elements and Clustering (d) Adding an extra newline between the instruction and user request, (e) Onscreen Grab, (f) Onscreen Grab with injected turn objects, (g) Onscreen Grab with injected turn object + needing lines to be separated by at least Margin, (h) Separating elements in the same line by a tab](https://arxiv.org/html/2403.20329v1/extracted/5502621/figures/onscreen_experiments.png)

*Figure 3: Performance improvements with each experiment – (a) Baseline Finetuned LLM, (b) Obtaining screen elements through OCR,
(c) Obtaining screen elements through UI elements and Clustering (d) Adding an extra newline between the instruction and user request, (e) Onscreen Grab, (f) Onscreen Grab with injected turn objects, (g) Onscreen Grab with injected turn object + needing lines to be separated by at least Margin, (h) Separating elements in the same line by a tab*

The algorithm describes the process of encoding visual elements displayed on the screen. It involves identifying and representing various objects, characters, or components that appear in the user interface or video content.

## Appendix B Entity Representations

In Table 8, the paper presents examples of different domains and their corresponding representations utilized as input for the large language model (LLM). These examples illustrate how diverse subject areas, such as chemistry, computer science, and mathematics, are encoded into a format suitable for processing by the LLM.

## Appendix C Sample Inputs

The provided section indicates that visual representations will be shown to illustrate how input data is encoded or represented within the model.