Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?

2405.05904

YC

36

Reddit

0

Published 5/14/2024 by Zorik Gekhman, Gal Yona, Roee Aharoni, Matan Eyal, Amir Feder, Roi Reichart, Jonathan Herzig

📊

Abstract

When large language models are aligned via supervised fine-tuning, they may encounter new factual information that was not acquired through pre-training. It is often conjectured that this can teach the model the behavior of hallucinating factually incorrect responses, as the model is trained to generate facts that are not grounded in its pre-existing knowledge. In this work, we study the impact of such exposure to new knowledge on the capability of the fine-tuned model to utilize its pre-existing knowledge. To this end, we design a controlled setup, focused on closed-book QA, where we vary the proportion of the fine-tuning examples that introduce new knowledge. We demonstrate that large language models struggle to acquire new factual knowledge through fine-tuning, as fine-tuning examples that introduce new knowledge are learned significantly slower than those consistent with the model's knowledge. However, we also find that as the examples with new knowledge are eventually learned, they linearly increase the model's tendency to hallucinate. Taken together, our results highlight the risk in introducing new factual knowledge through fine-tuning, and support the view that large language models mostly acquire factual knowledge through pre-training, whereas fine-tuning teaches them to use it more efficiently.

Get summaries of the top AI research delivered straight to your inbox:

Overview

  • Large language models can struggle with hallucinating factually incorrect responses when exposed to new information during fine-tuning
  • Researchers designed a controlled experiment to study the impact of new knowledge on a model's ability to utilize its pre-existing knowledge
  • The study found that large language models have difficulty acquiring new factual knowledge through fine-tuning, but as new information is learned, it can linearly increase the model's tendency to hallucinate

Plain English Explanation

Large language models, like GPT-3 or BERT, are trained on vast amounts of online text data to become highly capable at tasks like answering questions or generating human-like text. However, when these models are further trained, or "fine-tuned," on specific tasks using supervised learning, they may encounter new factual information that was not part of their original training.

Researchers are concerned that this exposure to new knowledge could cause the model to start "hallucinating" - generating factually incorrect responses that are not grounded in its pre-existing knowledge. The idea is that the model is being trained to produce specific facts, even if those facts don't match what the model already "knows."

To better understand this issue, the researchers in this study designed a controlled experiment focused on closed-book question answering. They varied the proportion of fine-tuning examples that introduced new knowledge the model didn't have before.

The key findings are:

  • Large language models struggle to acquire new factual knowledge through fine-tuning. Examples with new information are learned much slower than those consistent with the model's existing knowledge.
  • However, as the model does eventually learn the new information, it starts to increase the model's tendency to hallucinate - generating incorrect responses.

Overall, the results suggest that while fine-tuning can help language models use their existing knowledge more efficiently, introducing significant new factual information is risky and may lead to unreliable or inaccurate outputs. The researchers argue that the bulk of a language model's factual knowledge should come from its initial pre-training, not from fine-tuning.

Technical Explanation

The researchers designed a controlled experiment to study the impact of exposing large language models to new factual information during fine-tuning. They focused on the closed-book question answering task, where models must answer questions without access to external information sources.

The experimental setup involved fine-tuning a pre-trained language model on a question answering dataset, but with a twist. The researchers varied the proportion of examples in the fine-tuning dataset that contained new knowledge - facts the model did not have in its original pre-training.

By systematically changing the ratio of "new knowledge" to "known knowledge" examples, the researchers could observe how this affected the model's ability to learn and utilize its pre-existing factual information. The FLAME framework was used to measure the model's tendencies to hallucinate or stick to its original knowledge.

The key findings were:

  1. Large language models struggle to acquire new factual knowledge through fine-tuning. Examples containing new information were learned significantly slower than those consistent with the model's pre-existing knowledge.
  2. However, as the new knowledge examples were eventually learned, they linearly increased the model's tendency to hallucinate - generating factually incorrect responses.

These results suggest that the bulk of a language model's factual knowledge should come from its initial pre-training, rather than relying on fine-tuning to inject significant new information. The researchers argue this is because fine-tuning primarily teaches the model to use its existing knowledge more efficiently, rather than fundamentally expanding its knowledge base.

Critical Analysis

The researchers provide a nuanced and well-designed study that sheds light on an important issue in large language model development. By carefully controlling the exposure to new information during fine-tuning, they were able to isolate and quantify the effects on the model's behavior.

One key strength of the study is the use of the FLAME framework to measure hallucination tendencies. This provides a robust and principled way to assess the model's factual grounding, beyond just looking at raw task performance.

However, the study is limited to the closed-book question answering task. It would be interesting to see if similar dynamics play out in other fine-tuning scenarios, such as open-ended text generation or multi-task learning. The researchers acknowledge this as an area for future work.

Additionally, the study focuses on the high-level behaviors of the models, but does not delve into the underlying mechanisms that lead to hallucination. Further research could explore the model internals and architectural choices that contribute to this phenomenon.

Overall, this paper provides valuable insights into the challenges of expanding the knowledge of large language models through fine-tuning. The findings underscore the importance of robust pre-training as the foundation for factual knowledge, rather than relying on fine-tuning alone to build reliable and trustworthy AI systems.

Conclusion

This study highlights the risks of exposing large language models to significant new factual information during fine-tuning. While fine-tuning can help models use their existing knowledge more efficiently, the researchers found that it struggles to truly expand the model's knowledge base.

Crucially, as new information is gradually learned through fine-tuning, it can actually increase the model's tendency to hallucinate - generating factually incorrect responses that are not grounded in its pre-existing knowledge. This suggests that the bulk of a language model's factual knowledge should come from its initial pre-training, rather than fine-tuning.

The findings of this paper have important implications for the development of reliable and trustworthy AI systems. They underscore the need for robust pre-training strategies and careful fine-tuning procedures to ensure language models can utilize their knowledge accurately and avoid hallucinating. As the field of large language models continues to evolve, studies like this will be crucial for guiding best practices and addressing the fundamental challenges in this area.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Unfamiliar Finetuning Examples Control How Language Models Hallucinate

Unfamiliar Finetuning Examples Control How Language Models Hallucinate

Katie Kang, Eric Wallace, Claire Tomlin, Aviral Kumar, Sergey Levine

YC

0

Reddit

0

Large language models are known to hallucinate when faced with unfamiliar queries, but the underlying mechanism that govern how models hallucinate are not yet fully understood. In this work, we find that unfamiliar examples in the models' finetuning data -- those that introduce concepts beyond the base model's scope of knowledge -- are crucial in shaping these errors. In particular, we find that an LLM's hallucinated predictions tend to mirror the responses associated with its unfamiliar finetuning examples. This suggests that by modifying how unfamiliar finetuning examples are supervised, we can influence a model's responses to unfamiliar queries (e.g., say ``I don't know''). We empirically validate this observation in a series of controlled experiments involving SFT, RL, and reward model finetuning on TriviaQA and MMLU. Our work further investigates RL finetuning strategies for improving the factuality of long-form model generations. We find that, while hallucinations from the reward model can significantly undermine the effectiveness of RL factuality finetuning, strategically controlling how reward models hallucinate can minimize these negative effects. Leveraging our previous observations on controlling hallucinations, we propose an approach for learning more reliable reward models, and show that they improve the efficacy of RL factuality finetuning in long-form biography and book/movie plot generation tasks.

Read more

5/30/2024

On Large Language Models' Hallucination with Regard to Known Facts

On Large Language Models' Hallucination with Regard to Known Facts

Che Jiang, Biqing Qi, Xiangyu Hong, Dayuan Fu, Yang Cheng, Fandong Meng, Mo Yu, Bowen Zhou, Jie Zhou

YC

0

Reddit

0

Large language models are successful in answering factoid questions but are also prone to hallucination.We investigate the phenomenon of LLMs possessing correct answer knowledge yet still hallucinating from the perspective of inference dynamics, an area not previously covered in studies on hallucinations.We are able to conduct this analysis via two key ideas.First, we identify the factual questions that query the same triplet knowledge but result in different answers. The difference between the model behaviors on the correct and incorrect outputs hence suggests the patterns when hallucinations happen. Second, to measure the pattern, we utilize mappings from the residual streams to vocabulary space. We reveal the different dynamics of the output token probabilities along the depths of layers between the correct and hallucinated cases. In hallucinated cases, the output token's information rarely demonstrates abrupt increases and consistent superiority in the later stages of the model. Leveraging the dynamic curve as a feature, we build a classifier capable of accurately detecting hallucinatory predictions with an 88% success rate. Our study shed light on understanding the reasons for LLMs' hallucinations on their known facts, and more importantly, on accurately predicting when they are hallucinating.

Read more

4/1/2024

🧠

Towards a Holistic Evaluation of LLMs on Factual Knowledge Recall

Jiaqing Yuan, Lin Pan, Chung-Wei Hang, Jiang Guo, Jiarong Jiang, Bonan Min, Patrick Ng, Zhiguo Wang

YC

0

Reddit

0

Large language models (LLMs) have shown remarkable performance on a variety of NLP tasks, and are being rapidly adopted in a wide range of use cases. It is therefore of vital importance to holistically evaluate the factuality of their generated outputs, as hallucinations remain a challenging issue. In this work, we focus on assessing LLMs' ability to recall factual knowledge learned from pretraining, and the factors that affect this ability. To that end, we construct FACT-BENCH, a representative benchmark covering 20 domains, 134 property types, 3 answer types, and different knowledge popularity levels. We benchmark 31 models from 10 model families and provide a holistic assessment of their strengths and weaknesses. We observe that instruction-tuning hurts knowledge recall, as pretraining-only models consistently outperform their instruction-tuned counterparts, and positive effects of model scaling, as larger models outperform smaller ones for all model families. However, the best performance from GPT-4 still represents a large gap with the upper-bound. We additionally study the role of in-context exemplars using counterfactual demonstrations, which lead to significant degradation of factual knowledge recall for large models. By further decoupling model known and unknown knowledge, we find the degradation is attributed to exemplars that contradict a model's known knowledge, as well as the number of such exemplars. Lastly, we fine-tune LLaMA-7B in different settings of known and unknown knowledge. In particular, fine-tuning on a model's known knowledge is beneficial, and consistently outperforms fine-tuning on unknown and mixed knowledge. We will make our benchmark publicly available.

Read more

4/26/2024

⚙️

Self-Alignment for Factuality: Mitigating Hallucinations in LLMs via Self-Evaluation

Xiaoying Zhang, Baolin Peng, Ye Tian, Jingyan Zhou, Lifeng Jin, Linfeng Song, Haitao Mi, Helen Meng

YC

0

Reddit

0

Despite showing increasingly human-like abilities, large language models (LLMs) often struggle with factual inaccuracies, i.e. hallucinations, even when they hold relevant knowledge. To address these hallucinations, current approaches typically necessitate high-quality human factuality annotations. In this work, we explore Self-Alignment for Factuality, where we leverage the self-evaluation capability of an LLM to provide training signals that steer the model towards factuality. Specifically, we incorporate Self-Eval, a self-evaluation component, to prompt an LLM to validate the factuality of its own generated responses solely based on its internal knowledge. Additionally, we design Self-Knowledge Tuning (SK-Tuning) to augment the LLM's self-evaluation ability by improving the model's confidence estimation and calibration. We then utilize these self-annotated responses to fine-tune the model via Direct Preference Optimization algorithm. We show that the proposed self-alignment approach substantially enhances factual accuracy over Llama family models across three key knowledge-intensive tasks on TruthfulQA and BioGEN.

Read more

6/12/2024