![](/vectara/hallucination_evaluation_model/resolve/main/candle.png) In Loving memory of Simon Mark Hughes...

[](#introduction)Introduction
=============================

The HHEM model is an open source model, created by [Vectara](https://vectara.com), for detecting hallucinations in LLMs. It is particularly useful in the context of building retrieval-augmented-generation (RAG) applications where a set of facts is summarized by an LLM, but the model can also be used in other contexts.

If you are interested to learn more about RAG or experiment with Vectara, you can [sign up](https://console.vectara.com/signup/?utm_source=huggingface&utm_medium=space&utm_term=hhem-model&utm_content=console&utm_campaign=) for a free Vectara account. Vectara now implements an improved version of HHEM which is a calibrated score with longer sequence length, and non-English language support. See more details [here](https://vectara.com/blog/automating-hallucination-detection-introducing-vectara-factual-consistency-score/) Now let's dive into the details of the model.

[](#cross-encoder-for-hallucination-detection)Cross-Encoder for Hallucination Detection
---------------------------------------------------------------------------------------

This model was trained using [SentenceTransformers](https://sbert.net) [Cross-Encoder](https://www.sbert.net/examples/applications/cross-encoder/README.html) class. The model outputs a probabilitity from 0 to 1, 0 being a hallucination and 1 being factually consistent. The predictions can be thresholded at 0.5 to predict whether a document is consistent with its source.

[](#training-data)Training Data
-------------------------------

This model is based on [microsoft/deberta-v3-base](https://huggingface.co/microsoft/deberta-v3-base) and is trained initially on NLI data to determine textual entailment, before being further fine tuned on summarization datasets with samples annotated for factual consistency including [FEVER](https://huggingface.co/datasets/fever), [Vitamin C](https://huggingface.co/datasets/tals/vitaminc) and [PAWS](https://huggingface.co/datasets/paws).

[](#performance)Performance
---------------------------

*   [TRUE Dataset](https://arxiv.org/pdf/2204.04991.pdf) (Minus Vitamin C, FEVER and PAWS) - 0.872 AUC Score
*   [SummaC Benchmark](https://aclanthology.org/2022.tacl-1.10.pdf) (Test Split) - 0.764 Balanced Accuracy, 0.831 AUC Score
*   [AnyScale Ranking Test for Hallucinations](https://www.anyscale.com/blog/llama-2-is-about-as-factually-accurate-as-gpt-4-for-summaries-and-is-30x-cheaper) - 86.6 % Accuracy

[](#llm-hallucination-leaderboard)LLM Hallucination Leaderboard
---------------------------------------------------------------

If you want to stay up to date with results of the latest tests using this model to evaluate the top LLM models, we have a [public leaderboard](https://huggingface.co/spaces/vectara/leaderboard) that is periodically updated, and results are also available on the [GitHub repository](https://github.com/vectara/hallucination-leaderboard).

[](#using-hhem)Using HHEM
=========================

[](#using-the-inference-api-widget-on-the-right)Using the Inference API Widget on the Right
-------------------------------------------------------------------------------------------

To use the model with the widget, you need to pass both documents as a single string separated with \[SEP\]. For example:

*   A man walks into a bar and buys a drink \[SEP\] A bloke swigs alcohol at a pub
*   A person on a horse jumps over a broken down airplane. \[SEP\] A person is at a diner, ordering an omelette.
*   A person on a horse jumps over a broken down airplane. \[SEP\] A person is outdoors, on a horse.

etc. See examples below for expected probability scores.

[](#usage-with-sentencer-transformers-recommended)Usage with Sentencer Transformers (Recommended)
-------------------------------------------------------------------------------------------------

### [](#inference)Inference

The model can be used like this, on pairs of documents, passed as a list of list of strings (`List[List[str]]]`):

    from sentence_transformers import CrossEncoder
    
    model = CrossEncoder('vectara/hallucination_evaluation_model')
    scores = model.predict([
        ["A man walks into a bar and buys a drink", "A bloke swigs alcohol at a pub"],
        ["A person on a horse jumps over a broken down airplane.", "A person is at a diner, ordering an omelette."],
        ["A person on a horse jumps over a broken down airplane.", "A person is outdoors, on a horse."],
        ["A boy is jumping on skateboard in the middle of a red bridge.", "The boy skates down the sidewalk on a blue bridge"],
        ["A man with blond-hair, and a brown shirt drinking out of a public water fountain.", "A blond drinking water in public."],
        ["A man with blond-hair, and a brown shirt drinking out of a public water fountain.", "A blond man wearing a brown shirt is reading a book."],
        ["Mark Wahlberg was a fan of Manny.", "Manny was a fan of Mark Wahlberg."],  
    ])
    

This returns a numpy array representing a factual consistency score. A score < 0.5 indicates a likely hallucination):

    array([0.61051559, 0.00047493709, 0.99639291, 0.00021221573, 0.99599433, 0.0014127002, 0.002.8262993], dtype=float32)
    

Note that the model is designed to work with entire documents, so long as they fit into the 512 token context window (across both documents). Also note that the order of the documents is important, the first document is the source document, and the second document is validated against the first for factual consistency, e.g. as a summary of the first or a claim drawn from the source.

### [](#training)Training

    from sentence_transformers.cross_encoder import CrossEncoder
    from sentence_transformers.cross_encoder.evaluation import CEBinaryClassificationEvaluator
    from sentence_transformers import InputExample
    
    num_epochs = 5
    model_save_path = "./model_dump"
    model_name = 'cross-encoder/nli-deberta-v3-base' # base model, use 'vectara/hallucination_evaluation_model' if you want to further fine-tune ours
    
    model = CrossEncoder(model_name, num_labels=1, automodel_args={'ignore_mismatched_sizes':True})
    
    # Load some training examples as such, using a pandas dataframe with source and summary columns:
    train_examples, test_examples = [], []
    for i, row in df_train.iterrows():
        train_examples.append(InputExample(texts=[row['source'], row['summary']], label=int(row['label'])))
    
    for i, row in df_test.iterrows():
        test_examples.append(InputExample(texts=[row['source'], row['summary']], label=int(row['label'])))
    test_evaluator = CEBinaryClassificationEvaluator.from_input_examples(test_examples, name='test_eval')
    
    # Then train the model as such as per the Cross Encoder API:
    train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=train_batch_size)
    warmup_steps = math.ceil(len(train_dataloader) * num_epochs * 0.1) #10% of train data for warm-up
    model.fit(train_dataloader=train_dataloader,
              evaluator=test_evaluator,
              epochs=num_epochs,
              evaluation_steps=10_000,
              warmup_steps=warmup_steps,
              output_path=model_save_path,
              show_progress_bar=True)
    

[](#usage-with-transformers-automodel)Usage with Transformers AutoModel
-----------------------------------------------------------------------

You can use the model also directly with Transformers library (without the SentenceTransformers library):

    from transformers import AutoTokenizer, AutoModelForSequenceClassification
    import torch
    import numpy as np
    
    model = AutoModelForSequenceClassification.from_pretrained('vectara/hallucination_evaluation_model')
    tokenizer = AutoTokenizer.from_pretrained('vectara/hallucination_evaluation_model')
    
    pairs = [
        ["A man walks into a bar and buys a drink", "A bloke swigs alcohol at a pub"],
        ["A person on a horse jumps over a broken down airplane.", "A person is at a diner, ordering an omelette."],
        ["A person on a horse jumps over a broken down airplane.", "A person is outdoors, on a horse."],
        ["A boy is jumping on skateboard in the middle of a red bridge.", "The boy skates down the sidewalk on a blue bridge"],
        ["A man with blond-hair, and a brown shirt drinking out of a public water fountain.", "A blond drinking water in public."],
        ["A man with blond-hair, and a brown shirt drinking out of a public water fountain.", "A blond man wearing a brown shirt is reading a book."],
        ["Mark Wahlberg was a fan of Manny.", "Manny was a fan of Mark Wahlberg."], 
    ]
    
    inputs = tokenizer.batch_encode_plus(pairs, return_tensors='pt', padding=True)
    
    model.eval()
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits.cpu().detach().numpy()
        # convert logits to probabilities
        scores = 1 / (1 + np.exp(-logits)).flatten()
    

This returns a numpy array representing a factual consistency score. A score < 0.5 indicates a likely hallucination):

    array([0.61051559, 0.00047493709, 0.99639291, 0.00021221573, 0.99599433, 0.0014127002, 0.002.8262993], dtype=float32)
    

[](#contact-details)Contact Details
-----------------------------------

Feel free to contact us with any questions:

*   X/Twitter - [https://twitter.com/vectara](https://twitter.com/vectara) or [http://twitter.com/ofermend](http://twitter.com/ofermend)
*   Discussion [forums](https://discuss.vectara.com/)
*   Discord [server](https://discord.gg/GFb8gMz6UH)

For more information about [Vectara](https://vectara.com) and how to use our RAG-as-a-service API platform, check out our [documentation](https://docs.vectara.com/docs/).

## Model overview

The `hallucination_evaluation_model` is an open-source model created by [Vectara](https://vectara.com) to detect hallucinations in large language models (LLMs). It is particularly useful in the context of building retrieval-augmented-generation (RAG) applications, where an LLM summarizes a set of facts, but this model can also be used in other contexts. The model is based on the [SentenceTransformers](https://sbert.net) [Cross-Encoder](https://www.sbert.net/examples/applications/cross-encoder/README.html) class and is trained on datasets like [FEVER](https://huggingface.co/datasets/fever), [Vitamin C](https://huggingface.co/datasets/tals/vitaminc), and [PAWS](https://huggingface.co/datasets/paws) to determine textual entailment and factual consistency.

## Model inputs and outputs

### Inputs
- Text data, which can be summarizations or other outputs from large language models

### Outputs
- A probability score from 0 to 1, where 0 indicates a hallucination and 1 indicates a factually consistent output

## Capabilities

The `hallucination_evaluation_model` can be used to assess the factual consistency of text generated by large language models. This is particularly useful for applications like retrieval-augmented generation, where the model needs to maintain fidelity to the source information. The model has been evaluated on various benchmarks, including the [TRUE Dataset](https://arxiv.org/pdf/2204.04991.pdf), [SummaC](https://aclanthology.org/2022.tacl-1.10.pdf), and the [AnyScale Ranking Test for Hallucinations](https://www.anyscale.com/blog/llama-2-is-about-as-factually-accurate-as-gpt-4-for-summaries-and-is-30x-cheaper), achieving strong performance.

## What can I use it for?

The `hallucination_evaluation_model` can be used to assess the factual consistency of text generated by large language models, which is particularly useful for applications like retrieval-augmented generation, where the model needs to maintain fidelity to the source information. If you are interested in learning more about RAG or experimenting with Vectara, you can [sign up](https://console.vectara.com/signup/?utm_source=huggingface&utm_medium=space&utm_term=hhem-model&utm_content=console&utm_campaign=) for a free Vectara account.

## Things to try

One interesting thing to try with the `hallucination_evaluation_model` is to use it to evaluate the factual consistency of outputs from different large language models. This could help identify models that are better at maintaining fidelity to source information, which could be useful for a variety of applications. Additionally, you could experiment with using the model in the context of a retrieval-augmented generation system, to see how it performs in that specific use case.