Long-form factuality in large language models

2403.18802

YC

25

Reddit

0

Published 4/5/2024 by Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Hu, Jie Huang, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang and 2 others

💬

Abstract

Large language models (LLMs) often generate content that contains factual errors when responding to fact-seeking prompts on open-ended topics. To benchmark a model's long-form factuality in open domains, we first use GPT-4 to generate LongFact, a prompt set comprising thousands of questions spanning 38 topics. We then propose that LLM agents can be used as automated evaluators for long-form factuality through a method which we call Search-Augmented Factuality Evaluator (SAFE). SAFE utilizes an LLM to break down a long-form response into a set of individual facts and to evaluate the accuracy of each fact using a multi-step reasoning process comprising sending search queries to Google Search and determining whether a fact is supported by the search results. Furthermore, we propose extending F1 score as an aggregated metric for long-form factuality. To do so, we balance the percentage of supported facts in a response (precision) with the percentage of provided facts relative to a hyperparameter representing a user's preferred response length (recall). Empirically, we demonstrate that LLM agents can outperform crowdsourced human annotators - on a set of ~16k individual facts, SAFE agrees with crowdsourced human annotators 72% of the time, and on a random subset of 100 disagreement cases, SAFE wins 76% of the time. At the same time, SAFE is more than 20 times cheaper than human annotators. We also benchmark thirteen language models on LongFact across four model families (Gemini, GPT, Claude, and PaLM-2), finding that larger language models generally achieve better long-form factuality. LongFact, SAFE, and all experimental code are available at https://github.com/google-deepmind/long-form-factuality.

Get summaries of the top AI research delivered straight to your inbox:

Overview

  • Large language models (LLMs) can make factual errors when responding to open-ended questions.
  • Researchers developed a benchmark called LongFact to evaluate the long-form factuality of LLMs across many topics.
  • They also proposed a method called SAFE to automatically evaluate the factuality of LLM responses using search results.
  • SAFE was found to outperform human annotators while being much more cost-effective.
  • The researchers benchmarked several LLM families on the LongFact dataset, finding that larger models generally perform better on long-form factuality.

Plain English Explanation

Large language models are powerful AI systems that can generate human-like text on a wide range of topics. However, these models don't always get the facts right, especially when asked open-ended questions that require generating lengthy, detailed responses.

To address this, the researchers created a new benchmark called LongFact, which consists of thousands of questions spanning 38 different topics. They then developed a method called SAFE that uses the language model itself to evaluate the factual accuracy of the model's long-form responses. SAFE works by breaking down the response into individual facts, searching the web for supporting evidence, and then scoring the overall factuality.

Impressively, the researchers found that SAFE could outperform human annotators at this task, while being much more cost-effective. They also benchmarked several major language model families on the LongFact dataset, revealing that larger models tend to be more factually accurate when generating lengthy responses.

This research is significant because it provides a new way to rigorously evaluate the factuality of language models, which is crucial as these models become more powerful and prevalent. By identifying areas for improvement, this work can help drive the development of more reliable and trustworthy language AI systems.

Technical Explanation

The key elements of this paper are:

LongFact Benchmark: To evaluate long-form factuality, the researchers created a dataset called LongFact, which contains over 4,000 questions across 38 different topics. These questions were designed to elicit long, detailed responses from language models.

Search-Augmented Factuality Evaluator (SAFE): The researchers proposed a method called SAFE to automatically evaluate the factual accuracy of LLM responses. SAFE works by:

  1. Breaking down the long-form response into individual facts
  2. Generating search queries for each fact and checking the web search results
  3. Determining whether each fact is supported by the search results
  4. Calculating an overall factuality score that balances precision (percentage of supported facts) and recall (percentage of provided facts relative to a target length)

Benchmarking Experiments: The researchers benchmarked 13 language models from 4 different families (Gemini, GPT, Claude, and PaLM-2) on the LongFact dataset. They found that larger models generally achieved better long-form factuality scores.

Comparison to Human Annotators: The researchers compared SAFE's factuality evaluations to those of crowdsourced human annotators. They found that SAFE agreed with the human annotations 72% of the time, and won 76% of the time in a random sample of disagreement cases. Importantly, SAFE was also more than 20 times cheaper than human annotators.

Critical Analysis

The researchers acknowledge several limitations and areas for future work:

  • The LongFact dataset, while extensive, may not cover all possible topics and question types. Expanding the dataset could further test the capabilities of language models.
  • The SAFE method relies on web search results, which may not always provide definitive evidence for the accuracy of a fact. Incorporating additional information sources could improve SAFE's factuality assessments.
  • The researchers only benchmarked a limited set of language models. Expanding the evaluation to a wider range of models, including multilingual and specialized models, could provide additional insights.
  • While SAFE outperformed human annotators, it still made mistakes in a significant number of cases. Improving the underlying language model and search quality could further enhance SAFE's factuality evaluation abilities.

One potential concern not addressed in the paper is the potential for language models to "game" the SAFE system by generating responses that appear factual but are actually carefully crafted to pass the search-based evaluation, rather than being truly faithful to the underlying facts.

Conclusion

This research provides a valuable new benchmark and evaluation method for assessing the long-form factuality of large language models. By demonstrating that an automated system can outperform human annotators, the researchers have opened the door to more scalable and cost-effective ways of ensuring the reliability of language AI systems.

As these models become more capable and widely used, maintaining their factual accuracy will be crucial. The LongFact dataset and SAFE evaluation method represent important steps towards developing more trustworthy and transparent language AI that can be reliably used for a variety of real-world applications.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

↗️

Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers

Yuxia Wang, Revanth Gangi Reddy, Zain Muhammad Mujahid, Arnav Arora, Aleksandr Rubashevskii, Jiahui Geng, Osama Mohammed Afzal, Liangming Pan, Nadav Borenstein, Aditya Pillai, Isabelle Augenstein, Iryna Gurevych, Preslav Nakov

YC

0

Reddit

0

The increased use of large language models (LLMs) across a variety of real-world applications calls for mechanisms to verify the factual accuracy of their outputs. In this work, we present a holistic end-to-end solution for annotating the factuality of LLM-generated responses, which encompasses a multi-stage annotation scheme designed to yield detailed labels concerning the verifiability and factual inconsistencies found in LLM outputs. We further construct an open-domain document-level factuality benchmark in three-level granularity: claim, sentence and document, aiming to facilitate the evaluation of automatic fact-checking systems. Preliminary experiments show that FacTool, FactScore and Perplexity.ai are struggling to identify false claims, with the best F1=0.63 by this annotation solution based on GPT-4. Annotation tool, benchmark and code are available at https://github.com/yuxiaw/Factcheck-GPT.

Read more

4/17/2024

Multimodal Large Language Models to Support Real-World Fact-Checking

Multimodal Large Language Models to Support Real-World Fact-Checking

Jiahui Geng, Yova Kementchedjhieva, Preslav Nakov, Iryna Gurevych

YC

0

Reddit

0

Multimodal large language models (MLLMs) carry the potential to support humans in processing vast amounts of information. While MLLMs are already being used as a fact-checking tool, their abilities and limitations in this regard are understudied. Here is aim to bridge this gap. In particular, we propose a framework for systematically assessing the capacity of current multimodal models to facilitate real-world fact-checking. Our methodology is evidence-free, leveraging only these models' intrinsic knowledge and reasoning capabilities. By designing prompts that extract models' predictions, explanations, and confidence levels, we delve into research questions concerning model accuracy, robustness, and reasons for failure. We empirically find that (1) GPT-4V exhibits superior performance in identifying malicious and misleading multimodal claims, with the ability to explain the unreasonable aspects and underlying motives, and (2) existing open-source models exhibit strong biases and are highly sensitive to the prompt. Our study offers insights into combating false multimodal information and building secure, trustworthy multimodal models. To the best of our knowledge, we are the first to evaluate MLLMs for real-world fact-checking.

Read more

4/29/2024

💬

Large Language Models Help Humans Verify Truthfulness -- Except When They Are Convincingly Wrong

Chenglei Si, Navita Goyal, Sherry Tongshuang Wu, Chen Zhao, Shi Feng, Hal Daum'e III, Jordan Boyd-Graber

YC

0

Reddit

0

Large Language Models (LLMs) are increasingly used for accessing information on the web. Their truthfulness and factuality are thus of great interest. To help users make the right decisions about the information they get, LLMs should not only provide information but also help users fact-check it. Our experiments with 80 crowdworkers compare language models with search engines (information retrieval systems) at facilitating fact-checking. We prompt LLMs to validate a given claim and provide corresponding explanations. Users reading LLM explanations are significantly more efficient than those using search engines while achieving similar accuracy. However, they over-rely on the LLMs when the explanation is wrong. To reduce over-reliance on LLMs, we ask LLMs to provide contrastive information - explain both why the claim is true and false, and then we present both sides of the explanation to users. This contrastive explanation mitigates users' over-reliance on LLMs, but cannot significantly outperform search engines. Further, showing both search engine results and LLM explanations offers no complementary benefits compared to search engines alone. Taken together, our study highlights that natural language explanations by LLMs may not be a reliable replacement for reading the retrieved passages, especially in high-stakes settings where over-relying on wrong AI explanations could lead to critical consequences.

Read more

4/3/2024

🧠

Towards a Holistic Evaluation of LLMs on Factual Knowledge Recall

Jiaqing Yuan, Lin Pan, Chung-Wei Hang, Jiang Guo, Jiarong Jiang, Bonan Min, Patrick Ng, Zhiguo Wang

YC

0

Reddit

0

Large language models (LLMs) have shown remarkable performance on a variety of NLP tasks, and are being rapidly adopted in a wide range of use cases. It is therefore of vital importance to holistically evaluate the factuality of their generated outputs, as hallucinations remain a challenging issue. In this work, we focus on assessing LLMs' ability to recall factual knowledge learned from pretraining, and the factors that affect this ability. To that end, we construct FACT-BENCH, a representative benchmark covering 20 domains, 134 property types, 3 answer types, and different knowledge popularity levels. We benchmark 31 models from 10 model families and provide a holistic assessment of their strengths and weaknesses. We observe that instruction-tuning hurts knowledge recall, as pretraining-only models consistently outperform their instruction-tuned counterparts, and positive effects of model scaling, as larger models outperform smaller ones for all model families. However, the best performance from GPT-4 still represents a large gap with the upper-bound. We additionally study the role of in-context exemplars using counterfactual demonstrations, which lead to significant degradation of factual knowledge recall for large models. By further decoupling model known and unknown knowledge, we find the degradation is attributed to exemplars that contradict a model's known knowledge, as well as the number of such exemplars. Lastly, we fine-tune LLaMA-7B in different settings of known and unknown knowledge. In particular, fine-tuning on a model's known knowledge is beneficial, and consistently outperforms fine-tuning on unknown and mixed knowledge. We will make our benchmark publicly available.

Read more

4/26/2024