Large Language Models (LLMs) often generate erroneous outputs, known as hallucinations, due to their limitations in discerning questions beyond their knowledge scope. While addressing hallucination has been a focal point in research, previous efforts primarily concentrate on enhancing correctness without giving due consideration to the significance of rejection mechanisms. In this paper, we conduct a comprehensive examination of the role of rejection, introducing the notion of model reliability along with corresponding metrics. These metrics measure the model's ability to provide accurate responses while adeptly rejecting questions exceeding its knowledge boundaries, thereby minimizing hallucinations. To improve the inherent reliability of LLMs, we present a novel alignment framework called Reinforcement Learning from Knowledge Feedback (RLKF). RLKF leverages knowledge feedback to dynamically determine the model's knowledge boundary and trains a reliable reward model to encourage the refusal of out-of-knowledge questions. Experimental results on mathematical questions affirm the substantial efficacy of RLKF in significantly enhancing LLM reliability.

## Overview

- This paper explores a method for training large language models (LLMs) to reliably refuse questions that are outside of their knowledge or capabilities, in order to improve their overall reliability.
- The researchers used reinforcement learning (RL) from knowledge feedback to train the LLMs to identify when they were being asked questions they could not confidently answer, and to respond by refusing the question rather than attempting to generate an answer.
- The goal was to improve the overall reliability and trustworthiness of LLMs by reducing the instances of [hallucination](https://aimodels.fyi/papers/arxiv/large-language-models-hallucination-regard-to-known) or [self-generated incorrect responses](https://aimodels.fyi/papers/arxiv/self-incorrect-llms-struggle-refining-self-generated).

## Plain English Explanation

Large language models (LLMs) like GPT-3 have become incredibly capable at generating human-like text on a wide range of topics. However, they can also sometimes [produce responses that are inaccurate or don't actually reflect their true knowledge](https://aimodels.fyi/papers/arxiv/head-to-tail-how-knowledgeable-are-large). This can lead to users trusting the model's outputs even when they are mistaken or made up.

To address this, the researchers in this paper trained the LLMs to be more reliable by teaching them to recognize when they don't have enough information to confidently answer a question. Instead of guessing or [hallucinating](https://aimodels.fyi/papers/arxiv/large-language-models-hallucination-regard-to-known) an answer, the models were trained to simply refuse the question.

The key idea is that by being honest about the limits of their knowledge, the LLMs can [avoid generating misleading or [incorrect responses](https://aimodels.fyi/papers/arxiv/self-incorrect-llms-struggle-refining-self-generated), which ultimately makes them more trustworthy and reliable for users. This can be especially important in applications like [summarization](https://aimodels.fyi/papers/arxiv/dont-believe-everything-you-read-enhancing-summarization) or [conversational systems](https://aimodels.fyi/papers/arxiv/relic-investigating-large-language-model-responses-using) where the model's outputs can have real-world impacts.

## Technical Explanation

The researchers used a reinforcement learning (RL) approach to train the LLMs to refuse questions that were outside of their knowledge. During training, the model was presented with a mix of in-domain and out-of-domain questions. For in-domain questions, the model was rewarded for providing a correct answer. For out-of-domain questions, the model was rewarded for refusing to answer.

Over the course of training, the model learned to effectively identify when a question was outside of its capabilities, and to respond by politely refusing to answer rather than attempting to generate a response. The researchers evaluated the trained models on a held-out test set and found that they were able to reliably refuse out-of-domain questions while maintaining strong performance on in-domain questions.

## Critical Analysis

The researchers acknowledge several limitations of their approach. First, the model was only trained to refuse questions, and did not learn to provide more helpful responses like directing the user to a more appropriate resource. Expanding the model's capabilities in this direction could make it more useful in real-world applications.

Additionally, the training process relied on having a clear delineation between in-domain and out-of-domain questions, which may not always be the case in practice. Further research would be needed to understand how these models perform when faced with more ambiguous or borderline cases.

Overall, this work represents an important step towards building more reliable and trustworthy LLMs. By teaching models to be upfront about the limits of their knowledge, the researchers have demonstrated a promising approach for [enhancing the transparency and accountability](https://aimodels.fyi/papers/arxiv/relic-investigating-large-language-model-responses-using) of these increasingly influential AI systems.

## Conclusion

This paper presents a novel method for training large language models to reliably refuse questions that are outside of their knowledge or capabilities. By using reinforcement learning from knowledge feedback, the researchers were able to teach the models to identify when they were being asked something they couldn't confidently answer, and to respond by politely refusing rather than attempting to generate a potentially inaccurate or misleading response.

This approach has the potential to significantly improve the overall reliability and trustworthiness of LLMs, which is crucial as these models become more widely deployed in high-stakes applications. While further research is needed to address some of the limitations, this work represents an important step forward in building AI systems that are more transparent about their abilities and limitations.