Get a weekly rundown of the latest AI models and research... subscribe! https://aimodels.substack.com/

ReALM: Reference Resolution As Language Modeling

2403.20329

YC

139

Reddit

0

Published 4/1/2024 by Joel Ruben Antony Moniz, Soundarya Krishnan, Melis Ozyildirim, Prathamesh Saraf, Halim Cagri Ates, Yuan Zhang, Hong Yu, Nidhi Rajshree
ReALM: Reference Resolution As Language Modeling

Abstract

Reference resolution is an important problem, one that is essential to understand and successfully handle context of different kinds. This context includes both previous turns and context that pertains to non-conversational entities, such as entities on the user's screen or those running in the background. While LLMs have been shown to be extremely powerful for a variety of tasks, their use in reference resolution, particularly for non-conversational entities, remains underutilized. This paper demonstrates how LLMs can be used to create an extremely effective system to resolve references of various types, by showing how reference resolution can be converted into a language modeling problem, despite involving forms of entities like those on screen that are not traditionally conducive to being reduced to a text-only modality. We demonstrate large improvements over an existing system with similar functionality across different types of references, with our smallest model obtaining absolute gains of over 5% for on-screen references. We also benchmark against GPT-3.5 and GPT-4, with our smallest model achieving performance comparable to that of GPT-4, and our larger models substantially outperforming it.

Get summaries of the top AI research delivered straight to your inbox:

Introduction

The paper discusses the importance of understanding context, including ambiguous references, for conversational assistants to communicate effectively with users. While recent large language models (LLMs) can handle some contextual understanding, the authors argue for the continued value of traditional NLP pipelines in certain scenarios.

Specifically, they highlight four cases where a pipeline approach may be preferable: 1) when running on low-power devices with limited computing resources, making large end-to-end models infeasible; 2) when integrating with existing APIs or components, overhauling to a single LLM may be cumbersome; 3) a modular approach allows swapping improved reference resolution modules transparently; and 4) reference resolution needs to handle not just conversational context but also on-screen context.

The authors propose using smaller, fine-tuned language models specifically for reference resolution. To handle on-screen context, they suggest reconstructing the screen as a textual representation with entities tagged, allowing the language model to "see" and resolve references to on-screen elements. This is presented as a novel approach for enabling LLMs to understand on-screen context.

Related Work and Motivation

The paper discusses the need for conversational agents to understand on-screen references, which differ from visual and deictic references. On-screen references tend to be more structured and textual, enabling a text-only approach without visual components. They are also often action-oriented rather than question-answering based, and use synthetic screens rather than natural images.

The paper notes that jointly handling conversational and on-screen references has been relatively unexplored. While vision transformers and pre-trained models have gained prominence for visual understanding tasks, they are trained on natural images rather than screenshots, have different distributions, and can be computationally expensive.

The paper identifies limitations in existing approaches, such as relying on manually onboarding new entity types, treating each type distinctly without leveraging similarities, using hand-crafted rules and heuristics that lack robustness and semantic understanding, and classifying entity relevance independently without considering the whole screen context.

Task

The task involves identifying relevant entities from 3 types: on-screen entities currently displayed, conversational entities mentioned previously, and background entities from processes not directly visible. The goal is to extract the entities pertinent to the user's current query. The task is framed as a multiple choice problem, where the model outputs the relevant entities from the options shown on the user's screen, or "None of these" if applicable. During evaluation, any permutation of the correct entities is accepted as a valid answer.

Datasets

The datasets used contain user queries and a list of entities, along with the ground-truth relevant entity(ies) for each query. Each entity has information like type, name, and other details. If on-screen context exists, the entity's bounding box and surrounding objects/properties are included.

For conversational data, annotators were shown synthetic entity lists and asked to provide queries that unambiguously reference a chosen entity. For example, referring to a specific business by saying "Take me to the one second from the bottom."

Synthetic data was generated from templates - one with mentions, entities, slot values, and another with query variations for the entity references defined in the first template. Queries were created by substituting the references.

For on-screen data, annotators first classified displayed information into entity types like phone numbers and emails, then provided queries for that information. In a second phase, other annotators identified which entity in the list was referenced by each query.

Models

The paper compares the proposed model ReALM with two baseline approaches: a re-implementation of the reference resolver from a previous paper (MARRS), and ChatGPT using GPT-3.5 and GPT-4.

For the MARRS baseline, they trained a re-implementation of the system proposed in a previous paper, which is not based on large language models. This baseline was specifically designed for reference resolution.

For the ChatGPT baseline, they used the GPT-3.5 and GPT-4 versions, providing just the text prompt for GPT-3.5 and the prompt along with a screenshot for GPT-4's image capabilities for on-screen reference resolution.

Their approach, ReALM, involves fine-tuning a FLAN-T5 language model. They convert the data into a sentence format to feed to the model, shuffling the entities to prevent overfitting to positions. For conversational references, they assume two types: type-based (relying on entity types) and descriptive (relying on properties). Their encoding captures both types and properties.

For on-screen references, they assume upstream detectors can parse screen text and extract entities with types, bounding boxes, and surrounding text. They use a novel algorithm to encode the screen layout as text based on sorting bounding box centers top-to-bottom and left-to-right with line breaks.

Results

The paper presents their experimental results, indicating that their proposed approach outperforms the MARRS model across all types of datasets. It also surpasses the performance of GPT-3.5, which has a significantly larger number of parameters. Their approach performs comparably to the latest GPT-4, despite being a much lighter and faster model. The gains are particularly notable on onscreen datasets, where their model with textual encoding performs almost as well as GPT-4, even though the latter is provided with screenshots.

The authors also experiment with models of different sizes, observing that performance improves across all datasets as model size increases, but the difference is most pronounced for the complex onscreen datasets.

In an analysis section, the paper explores the zero-shot performance of their model on an unseen domain, Alarms. Their approach and GPT-4 perform similarly well on this unseen test set, outperforming a finetuned model.

The paper also highlights that their model, ReaLM, demonstrates superior understanding of domain-specific queries compared to GPT-4 due to finetuning on user requests. An example illustrates GPT-4 incorrectly assuming a reference is only about a setting, while the ground truth includes a home automation device, which ReaLM can recognize due to its domain-specific training.

Conclusion and Future Work

The paper demonstrates how large language models can be used for reference resolution by encoding entities as natural text. A novel approach represents on-screen entities and their relative positions in a textual format, which is then passed to the language model. This method, called ReaLM, outperforms previous approaches and performs comparably to GPT-4, despite having fewer parameters. ReaLM even surpasses GPT-4 for domain-specific user utterances, making it an ideal choice for practical reference resolution systems that can run on-device without compromising performance.

However, while ReaLM effectively encodes the position of entities, it may lose nuanced positional information required for complex user queries. The authors suggest exploring more complex approaches, such as dividing the screen into a grid and encoding relative spatial positions into text, as a promising avenue for future research.

Ethics Statement

The system allows for constraining the language model's output or applying post-processing to prevent unexpected generations. However, the authors state that in practice, they encounter very little hallucination or fabricated content from the language model. As a result, they do not constrain the model's decoding or generation process.

Acknowledgements

The authors express gratitude to Stephen Pulman, Leon Liyang Zhang, Jiarui Lu, Jeff Nichols, Shruti Bhargava, Dhivya Piraviperumal, and Junhan Chen for their assistance and feedback throughout the research process.

Appendix A Encoding onscreen entities

The paper presents visual examples of how screen grabs might appear when parsed and processed by the system. These sample representations are displayed in Figure 2 of the paper.

(a) Onscreen Capture 1

(a) Onscreen Capture 1

The paper explores different strategies for encoding on-screen elements.

  1. Clustering: Objects on the screen are spatially clustered into semantic groups. Users can refer to nearby bounding boxes by a descriptive title. However, as the number of entities in a cluster increases, the prompt length explodes since each object lists all other objects as surrounding entities.

  2. Onscreen Grab: The screen is parsed, but turn objects are provided as a separate list instead of being annotated within the parse.

  3. Onscreen Grab with Injected Turn Objects: This is the final approach used. The screen is parsed, and turn objects are annotated within the parse itself.

The paper presents an algorithm for the final approach and provides sample encodings for each strategy. It also includes an ablation study comparing the performance of the different encoding approaches.

Figure 3: Performance improvements with each experiment – (a) Baseline Finetuned LLM, (b) Obtaining screen elements through OCR,
(c) Obtaining screen elements through UI elements and Clustering (d) Adding an extra newline between the instruction and user request, (e) Onscreen Grab, (f) Onscreen Grab with injected turn objects, (g) Onscreen Grab with injected turn object + needing lines to be separated by at least Margin, (h) Separating elements in the same line by a tab

Figure 3: Performance improvements with each experiment – (a) Baseline Finetuned LLM, (b) Obtaining screen elements through OCR, (c) Obtaining screen elements through UI elements and Clustering (d) Adding an extra newline between the instruction and user request, (e) Onscreen Grab, (f) Onscreen Grab with injected turn objects, (g) Onscreen Grab with injected turn object + needing lines to be separated by at least Margin, (h) Separating elements in the same line by a tab

The algorithm describes the process of encoding visual elements displayed on the screen. It involves identifying and representing various objects, characters, or components that appear in the user interface or video content.

Appendix B Entity Representations

In Table 8, the paper presents examples of different domains and their corresponding representations utilized as input for the large language model (LLM). These examples illustrate how diverse subject areas, such as chemistry, computer science, and mathematics, are encoded into a format suitable for processing by the LLM.

Appendix C Sample Inputs

The provided section indicates that visual representations will be shown to illustrate how input data is encoded or represented within the model.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Transcrib3D: 3D Referring Expression Resolution through Large Language Models

Transcrib3D: 3D Referring Expression Resolution through Large Language Models

Jiading Fang, Xiangshan Tan, Shengjie Lin, Igor Vasiljevic, Vitor Guizilini, Hongyuan Mei, Rares Ambrus, Gregory Shakhnarovich, Matthew R Walter

YC

0

Reddit

0

If robots are to work effectively alongside people, they must be able to interpret natural language references to objects in their 3D environment. Understanding 3D referring expressions is challenging -- it requires the ability to both parse the 3D structure of the scene and correctly ground free-form language in the presence of distraction and clutter. We introduce Transcrib3D, an approach that brings together 3D detection methods and the emergent reasoning capabilities of large language models (LLMs). Transcrib3D uses text as the unifying medium, which allows us to sidestep the need to learn shared representations connecting multi-modal inputs, which would require massive amounts of annotated 3D data. As a demonstration of its effectiveness, Transcrib3D achieves state-of-the-art results on 3D reference resolution benchmarks, with a great leap in performance from previous multi-modality baselines. To improve upon zero-shot performance and facilitate local deployment on edge computers and robots, we propose self-correction for fine-tuning that trains smaller models, resulting in performance close to that of large models. We show that our method enables a real robot to perform pick-and-place tasks given queries that contain challenging referring expressions. Project site is at https://ripl.github.io/Transcrib3D.

Read more

5/1/2024

Bespoke Large Language Models for Digital Triage Assistance in Mental Health Care

Bespoke Large Language Models for Digital Triage Assistance in Mental Health Care

Niall Taylor, Andrey Kormilitzin, Isabelle Lorge, Alejo Nevado-Holgado, Dan W Joyce

YC

0

Reddit

0

Contemporary large language models (LLMs) may have utility for processing unstructured, narrative free-text clinical data contained in electronic health records (EHRs) -- a particularly important use-case for mental health where a majority of routinely-collected patient data lacks structured, machine-readable content. A significant problem for the the United Kingdom's National Health Service (NHS) are the long waiting lists for specialist mental healthcare. According to NHS data, in each month of 2023, there were between 370,000 and 470,000 individual new referrals into secondary mental healthcare services. Referrals must be triaged by clinicians, using clinical information contained in the patient's EHR to arrive at a decision about the most appropriate mental healthcare team to assess and potentially treat these patients. The ability to efficiently recommend a relevant team by ingesting potentially voluminous clinical notes could help services both reduce referral waiting times and with the right technology, improve the evidence available to justify triage decisions. We present and evaluate three different approaches for LLM-based, end-to-end ingestion of variable-length clinical EHR data to assist clinicians when triaging referrals. Our model is able to deliver triage recommendations consistent with existing clinical practices and it's architecture was implemented on a single GPU, making it practical for implementation in resource-limited NHS environments where private implementations of LLM technology will be necessary to ensure confidential clinical data is appropriately controlled and governed.

Read more

4/1/2024

🖼️

Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models

Hongyi Zhu, Jia-Hong Huang, Stevan Rudinac, Evangelos Kanoulas

YC

0

Reddit

0

Image search stands as a pivotal task in multimedia and computer vision, finding applications across diverse domains, ranging from internet search to medical diagnostics. Conventional image search systems operate by accepting textual or visual queries, retrieving the top-relevant candidate results from the database. However, prevalent methods often rely on single-turn procedures, introducing potential inaccuracies and limited recall. These methods also face the challenges, such as vocabulary mismatch and the semantic gap, constraining their overall effectiveness. To address these issues, we propose an interactive image retrieval system capable of refining queries based on user relevance feedback in a multi-turn setting. This system incorporates a vision language model (VLM) based image captioner to enhance the quality of text-based queries, resulting in more informative queries with each iteration. Moreover, we introduce a large language model (LLM) based denoiser to refine text-based query expansions, mitigating inaccuracies in image descriptions generated by captioning models. To evaluate our system, we curate a new dataset by adapting the MSR-VTT video retrieval dataset to the image retrieval task, offering multiple relevant ground truth images for each query. Through comprehensive experiments, we validate the effectiveness of our proposed system against baseline methods, achieving state-of-the-art performance with a notable 10% improvement in terms of recall. Our contributions encompass the development of an innovative interactive image retrieval system, the integration of an LLM-based denoiser, the curation of a meticulously designed evaluation dataset, and thorough experimental validation.

Read more

4/30/2024

Recommender Systems in the Era of Large Language Models (LLMs)

Recommender Systems in the Era of Large Language Models (LLMs)

Zihuai Zhao, Wenqi Fan, Jiatong Li, Yunqing Liu, Xiaowei Mei, Yiqi Wang, Zhen Wen, Fei Wang, Xiangyu Zhao, Jiliang Tang, Qing Li

YC

0

Reddit

0

With the prosperity of e-commerce and web applications, Recommender Systems (RecSys) have become an important component of our daily life, providing personalized suggestions that cater to user preferences. While Deep Neural Networks (DNNs) have made significant advancements in enhancing recommender systems by modeling user-item interactions and incorporating textual side information, DNN-based methods still face limitations, such as difficulties in understanding users' interests and capturing textual side information, inabilities in generalizing to various recommendation scenarios and reasoning on their predictions, etc. Meanwhile, the emergence of Large Language Models (LLMs), such as ChatGPT and GPT4, has revolutionized the fields of Natural Language Processing (NLP) and Artificial Intelligence (AI), due to their remarkable abilities in fundamental responsibilities of language understanding and generation, as well as impressive generalization and reasoning capabilities. As a result, recent studies have attempted to harness the power of LLMs to enhance recommender systems. Given the rapid evolution of this research direction in recommender systems, there is a pressing need for a systematic overview that summarizes existing LLM-empowered recommender systems, to provide researchers in relevant fields with an in-depth understanding. Therefore, in this paper, we conduct a comprehensive review of LLM-empowered recommender systems from various aspects including Pre-training, Fine-tuning, and Prompting. More specifically, we first introduce representative methods to harness the power of LLMs (as a feature encoder) for learning representations of users and items. Then, we review recent techniques of LLMs for enhancing recommender systems from three paradigms, namely pre-training, fine-tuning, and prompting. Finally, we comprehensively discuss future directions in this emerging field.

Read more

4/23/2024