0

0

LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations

    Published 10/29/2024 by Hadas Orgad, Michael Toker, Zorik Gekhman, Roi Reichart, Idan Szpektor, Hadas Kotek, Yonatan Belinkov

    Overview

    • The paper explores the intrinsic representation of hallucinations in large language models (LLMs).
    • Hallucinations refer to the generation of plausible-sounding but factually incorrect text by LLMs.
    • The research aims to understand the internal mechanisms underlying these hallucinations and their potential implications.

    TriviaQA dataset example, showing input, output, and probe tokens.

    1/4

    TriviaQA dataset example, showing input, output, and probe tokens.

    Original caption: Figure 1: Example for the input and LLM output from the TriviaQA dataset, and the names of the tokens that can be probed.

    Comparison of error detection techniques across models and datasets, using AUC. Bolded values indicate best performers.

    1/2

    Metric Mistral-7b-Instruct - TriviaQA Mistral-7b-Instruct - Winobias Mistral-7b-Instruct - Math Llama 3-8b-Instruct - TriviaQA Llama 3-8b-Instruct - Winobias Llama 3-8b-Instruct - Math
    Logits-mean 0.60 ± 0.009 0.56 ± 0.017 0.55 ± 0.029 0.66 ± 0.005 0.60 ± 0.026 0.75 ± 0.018
    Logits-mean-exact 0.68 ± 0.007 0.54 ± 0.012 0.51 ± 0.005 0.71 ± 0.006 0.55 ± 0.019 0.80 ± 0.021
    Logits-min 0.63 ± 0.008 0.59 ± 0.012 0.51 ± 0.017 0.74 ± 0.007 0.61 ± 0.024 0.75 ± 0.016
    Logits-min-exact 0.75 ± 0.006 0.53 ± 0.013 0.71 ± 0.009 0.79 ± 0.006 0.61 ± 0.019 0.89 ± 0.018
    p(True) 0.66 ± 0.006 0.45 ± 0.021 0.48 ± 0.022 0.73 ± 0.008 0.59 ± 0.020 0.62 ± 0.017
    p(True)-exact 0.74 ± 0.003 0.40 ± 0.021 0.60 ± 0.025 0.73 ± 0.005 0.63 ± 0.014 0.59 ± 0.018
    Probe @ token N/A N/A N/A N/A N/A N/A
    Last generated [-1] 0.71 ± 0.006 0.82 ± 0.004 0.74 ± 0.008 0.81 ± 0.005 0.86 ± 0.007 0.82 ± 0.016
    Before last generated [-2] 0.73 ± 0.004 0.85 ± 0.004 0.74 ± 0.007 0.75 ± 0.005 0.88 ± 0.005 0.79 ± 0.020
    End of question 0.76 ± 0.008 0.82 ± 0.011 0.72 ± 0.007 0.77 ± 0.007 0.80 ± 0.018 0.72 ± 0.023
    Exact 0.85 ± 0.004 0.92 ± 0.005 0.92 ± 0.008 0.83 ± 0.002 0.93 ± 0.004 0.95 ± 0.027

    Original caption: Table 1: Comparison of error detection techniques using AUC metric, across different models and datasets. The best-performing method is bolded. Using exact answer tokens is useful for many cases, especially probing.

    Plain English Explanation

    Large language models (LLMs) like GPT-3 have become incredibly powerful at generating human-like text. However, they can sometimes produce information that seems convincing but is actually false or inaccurate - a phenomenon known as "hallucination".

    This paper delves into the inner workings of LLMs to try to understand why and how these hallucinations occur. The researchers found that LLMs actually have the capacity to represent truthful information, but they often fail to use this capacity and instead output inaccurate text. This suggests that LLMs "know more than they show" and that their hallucinations may be an intrinsic part of how they operate.

    By understanding the mechanisms behind LLM hallucinations, the researchers hope to find ways to make these models more reliable and truthful in the future. This is an important step as LLMs become increasingly prevalent in applications like text generation, question answering, and decision support.

    Technical Explanation

    The paper investigates the intrinsic representation of hallucinations within large language models (LLMs). Hallucinations refer to the generation of plausible-sounding but factually incorrect text by LLMs.

    Through a series of experiments, the researchers found that LLMs actually have the capacity to represent truthful information internally, but they often fail to utilize this capacity and instead output inaccurate text. This suggests that LLM hallucinations are not simply the result of missing knowledge, but rather an intrinsic part of how these models operate.

    Specifically, the researchers trained LLMs on datasets with known ground truth and then analyzed the models' internal representations. They found that the truthful information was present in the models' internal states, but was often overshadowed by other signals that led to hallucinations.

    The researchers also demonstrated that it is possible to edit the internal states of LLMs to increase the salience of the truthful information and reduce hallucinations. This suggests that there may be ways to mitigate the hallucination problem in LLMs by directly modifying their internal representations.

    Critical Analysis

    The paper provides important insights into the nature of hallucinations in large language models. By demonstrating that LLMs have the capacity to represent truthful information, the researchers challenge the assumption that hallucinations are simply the result of missing knowledge or training data.

    However, the paper does not fully explain the underlying mechanisms that lead LLMs to prioritize inaccurate information over truthful information during text generation. Additional research is needed to understand the specific factors and biases that contribute to this phenomenon.

    Furthermore, while the ability to edit LLM internal states to reduce hallucinations is promising, the practicality and scalability of this approach remains to be seen. More work is needed to develop robust and generalizable techniques for ensuring the truthfulness of LLM outputs.

    Conclusion

    This paper offers a thought-provoking perspective on the nature of hallucinations in large language models. By revealing that LLMs possess the intrinsic capacity to represent truthful information, the researchers challenge the assumption that hallucinations are simply the result of missing knowledge.

    The findings suggest that the hallucination problem may be a more fundamental aspect of how LLMs operate, with important implications for the development of reliable and trustworthy AI systems. While further research is needed to fully understand and address this issue, this paper represents a valuable contribution to the ongoing efforts to improve the safety and robustness of large language models.

    Full paper

    Loading...

    Loading PDF viewer...

    Read original: arXiv:2410.02707



    This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

    Total Score

    95

    Follow @aimodelsfyi on 𝕏 →