Large language models (LLMs) often generate content that contains factual errors when responding to fact-seeking prompts on open-ended topics. To benchmark a model's long-form factuality in open domains, we first use GPT-4 to generate LongFact, a prompt set comprising thousands of questions spanning 38 topics. We then propose that LLM agents can be used as automated evaluators for long-form factuality through a method which we call Search-Augmented Factuality Evaluator (SAFE). SAFE utilizes an LLM to break down a long-form response into a set of individual facts and to evaluate the accuracy of each fact using a multi-step reasoning process comprising sending search queries to Google Search and determining whether a fact is supported by the search results. Furthermore, we propose extending F1 score as an aggregated metric for long-form factuality. To do so, we balance the percentage of supported facts in a response (precision) with the percentage of provided facts relative to a hyperparameter representing a user's preferred response length (recall).
  Empirically, we demonstrate that LLM agents can outperform crowdsourced human annotators - on a set of ~16k individual facts, SAFE agrees with crowdsourced human annotators 72% of the time, and on a random subset of 100 disagreement cases, SAFE wins 76% of the time. At the same time, SAFE is more than 20 times cheaper than human annotators. We also benchmark thirteen language models on LongFact across four model families (Gemini, GPT, Claude, and PaLM-2), finding that larger language models generally achieve better long-form factuality. LongFact, SAFE, and all experimental code are available at https://github.com/google-deepmind/long-form-factuality.

## Overview

- Large language models (LLMs) can make factual errors when responding to open-ended questions.
- Researchers developed a benchmark called LongFact to evaluate the long-form factuality of LLMs across many topics.
- They also proposed a method called SAFE to automatically evaluate the factuality of LLM responses using search results.
- SAFE was found to outperform human annotators while being much more cost-effective.
- The researchers benchmarked several LLM families on the LongFact dataset, finding that larger models generally perform better on long-form factuality.

## Plain English Explanation

Large language models are powerful AI systems that can generate human-like text on a wide range of topics. However, these models don't always get the facts right, especially when asked open-ended questions that require generating lengthy, detailed responses.

To address this, the researchers created a new benchmark called LongFact, which consists of thousands of questions spanning 38 different topics. They then developed a method called SAFE that uses the language model itself to evaluate the factual accuracy of the model's long-form responses. SAFE works by breaking down the response into individual facts, searching the web for supporting evidence, and then scoring the overall factuality.

Impressively, the researchers found that SAFE could outperform human annotators at this task, while being much more cost-effective. They also benchmarked several major language model families on the LongFact dataset, revealing that larger models tend to be more factually accurate when generating lengthy responses.

This research is significant because it provides a new way to rigorously evaluate the factuality of language models, which is crucial as these models become more powerful and prevalent. By identifying areas for improvement, this work can help drive the development of more reliable and trustworthy language AI systems.

## Technical Explanation

The key elements of this paper are:

**LongFact Benchmark**: To evaluate long-form factuality, the researchers created a dataset called LongFact, which contains over 4,000 questions across 38 different topics. These questions were designed to elicit long, detailed responses from language models.

**Search-Augmented Factuality Evaluator (SAFE)**: The researchers proposed a method called SAFE to automatically evaluate the factual accuracy of LLM responses. SAFE works by:
1. Breaking down the long-form response into individual facts
2. Generating search queries for each fact and checking the web search results
3. Determining whether each fact is supported by the search results
4. Calculating an overall factuality score that balances precision (percentage of supported facts) and recall (percentage of provided facts relative to a target length)

**Benchmarking Experiments**: The researchers benchmarked 13 language models from 4 different families (Gemini, GPT, Claude, and PaLM-2) on the LongFact dataset. They found that larger models generally achieved better long-form factuality scores.

**Comparison to Human Annotators**: The researchers compared SAFE's factuality evaluations to those of crowdsourced human annotators. They found that SAFE agreed with the human annotations 72% of the time, and won 76% of the time in a random sample of disagreement cases. Importantly, SAFE was also more than 20 times cheaper than human annotators.

## Critical Analysis

The researchers acknowledge several limitations and areas for future work:

- The LongFact dataset, while extensive, may not cover all possible topics and question types. Expanding the dataset could further test the capabilities of language models.
- The SAFE method relies on web search results, which may not always provide definitive evidence for the accuracy of a fact. Incorporating additional information sources could improve SAFE's factuality assessments.
- The researchers only benchmarked a limited set of language models. Expanding the evaluation to a wider range of models, including multilingual and specialized models, could provide additional insights.
- While SAFE outperformed human annotators, it still made mistakes in a significant number of cases. Improving the underlying language model and search quality could further enhance SAFE's factuality evaluation abilities.

One potential concern not addressed in the paper is the potential for language models to "game" the SAFE system by generating responses that appear factual but are actually carefully crafted to pass the search-based evaluation, rather than being truly faithful to the underlying facts.

## Conclusion

This research provides a valuable new benchmark and evaluation method for assessing the long-form factuality of large language models. By demonstrating that an automated system can outperform human annotators, the researchers have opened the door to more scalable and cost-effective ways of ensuring the reliability of language AI systems.

As these models become more capable and widely used, maintaining their factual accuracy will be crucial. The LongFact dataset and SAFE evaluation method represent important steps towards developing more trustworthy and transparent language AI that can be reliably used for a variety of real-world applications.