[![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c14f6b02e1f8f67c73bd05/pf4d6FA7DriRtVq5HCkxd.png)](https://cdn-uploads.huggingface.co/production/uploads/64c14f6b02e1f8f67c73bd05/pf4d6FA7DriRtVq5HCkxd.png)

[![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c14f6b02e1f8f67c73bd05/e4u8VYfDBh11u60rFYJHF.png)](https://cdn-uploads.huggingface.co/production/uploads/64c14f6b02e1f8f67c73bd05/e4u8VYfDBh11u60rFYJHF.png)

Smaug arrives!

We recently released Smaug-72B-v0.1 which has taken first place on the Open LLM Leaderboard by HuggingFace. It is the first open-source model to surpass an average score of 80%.

Smaug-72B is finetuned directly from [moreh/MoMo-72B-lora-1.8.7-DPO](https://huggingface.co/moreh/MoMo-72B-lora-1.8.7-DPO) and is ultimately based on [Qwen-72B](https://huggingface.co/Qwen/Qwen-72B).

We created Smaug-72B-v0.1 using a new fine-tuning technique, DPO-Positive (DPOP), and new pairwise preference versions of ARC, HellaSwag, and MetaMath (as well as other existing datasets). We introduce the technique and the full training details in our new paper: [https://arxiv.org/abs/2402.13228](https://arxiv.org/abs/2402.13228).

We show that on datasets in which the edit distance between pairs of completions is low (such as in math-based datasets), standard DPO loss can lead to a reduction of the model's likelihood of the preferred examples, as long as the relative probability between the preferred and dispreferred classes increases. Using these insights, we design DPOP, a new loss function and training procedure which avoids this failure mode. Surprisingly, we also find that DPOP outperforms DPO across a wide variety of datasets and downstream tasks, including datasets with high edit distances between completions.

We believe this new approach is generally useful in training across a wide range of model types and downstream use cases, and it powers all of our Smaug models. With the release of our paper and datasets, we are excited for the open source community to continue to build on and improve Smaug and spawn more dragons to dominate the LLM space!

### [](#evaluation-results)Evaluation Results

Average

ARC

HellaSwag

MMLU

TruthfulQA

Winogrande

GSM8K

80.48

76.02

89.27

77.15

76.67

85.08

78.70

#### [](#mt-bench)MT-Bench

We ran MT-Bench with the llama-2 conversation template and the system prompt set to the Qwen system prompt. We got the following results in single model mode:

First Turn

Second Turn

Average

8.18

7.34

7.76

The bottom of the card includes some sample responses from the model. We have selected responses that have high and low scores from the GPT-4 judge.

### [](#contamination-results)Contamination Results

We generate our contamination numbers using [https://github.com/swj0419/detect-pretrain-code-contamination/tree/master](https://github.com/swj0419/detect-pretrain-code-contamination/tree/master), with Llama7B as our reference model. Smaug-72B has the following results:

ARC

TruthfulQA

GSM8K

0.20

0.45

1.00

By comparison, MoMo-72B-lora-1.8.7-DPO has the following results:

ARC

TruthfulQA

GSM8K

0.20

0.39

1.00

Note that GSM8K often scores very highly on this contamination suite - we verified this by also running Llama-2-70B:

ARC

TruthfulQA

GSM8K

0.22

0.51

0.89

### [](#sample-mt-bench-results)Sample MT Bench results

#### [](#example-1)Example 1

**Question**

Imagine you are writing a blog post comparing two popular smartphone models. Develop an outline for the blog post, including key points and subheadings to effectively compare and contrast the features, performance, and user experience of the two models. Please answer in fewer than 200 words.

**Response**

Outline:

1.  Introduction to the two smartphone models being compared (mention brand names and models)
2.  Design and Build Quality
    *   Comparison of dimensions, weight, materials used, and overall aesthetic appeal
    *   Water and dust resistance, if applicable
3.  Display and Screen Quality
    *   Comparison of screen technology (OLED, IPS LCD, etc.)
    *   Display resolution, refresh rate, and brightness levels
4.  Performance and Hardware
    *   Processor, RAM, and storage configurations
    *   Comparison of benchmark scores (e.g., AnTuTu, Geekbench)
5.  Camera System and Image Quality
    *   Comparison of primary, ultra-wide, telephoto, and macro lenses (if available)
    *   Comparison of low-light performance and image processing capabilities
6.  Battery Life and Charging
    *   Battery capacity and estimated usage time
    *   Charging technologies (wired, wireless, fast charging, reverse charging)
7.  Software and User Experience
    *   Comparison of operating systems (iOS vs. Android) and their features
    *   User interface, customization options, and app ecosystem
8.  Additional Features and Extras
    *   Comparison of biometric authentication (fingerprint, face unlock, etc.)
    *   Additional features like NFC, 5G connectivity, and audio quality
9.  Pricing and Value for Money
    *   Comparison of retail prices and current offers
    *   Assessment of overall value considering the features and performance
10.  Conclusion and recommendation for potential buyers based on their preferences

#### [](#example-2)Example 2

**Question**

In a survey conducted at a local high school, preferences for a new school color were measured: 58% of students liked the color blue, 45% preferred green, and 22% liked both colors. If we randomly pick a student from the school, what's the probability that they would like neither blue nor green?

**Answer**

To find the probability that a randomly picked student would like neither blue nor green, we need to find the percentage of students who like only one color or neither color.

From the given information, we know:

*   58% like blue
*   45% prefer green
*   22% like both colors

To find the percentage of students who like only one color, we can subtract the percentage of students who like both colors from the sum of students who like blue and green:

*   58% (blue) + 45% (green) - 22% (both) = 83%

Now, to find the probability of a student liking neither blue nor green, we subtract the percentage of students who like at least one color from 100%:

*   100% - 83% = 17%

So, the probability that a randomly picked student would like neither blue nor green is 17%.

#### [](#example-3)Example 3

**Question**

Implement a program to find the common elements in two arrays without using any extra data structures.

**Answer**

Finding common elements in two arrays without using extra data structures can be achieved by using a simple algorithm. Here's a Python implementation of this problem:

    def find_common_elements(arr1, arr2):
        common_elements = []
        for num1 in arr1:
            if num1 in arr2 and num1 not in common_elements:
                common_elements.append(num1)
        return common_elements
    
    # Example usage:
    array1 = [1, 2, 3, 4, 5]
    array2 = [4, 5, 6, 7, 8]
    print(find_common_elements(array1, array2))  # Output: [4, 5]
    

This program iterates through the elements of the first array (arr1) and checks if each element is present in the second array (arr2). If it is, and the element hasn't been added to the common\_elements list yet, it is appended. The function then returns the list of common elements.

#### [](#example-4)Example 4

**Question**

Identify the named entities (people, organizations, locations) mentioned in the given news article. Please generate a JSON dictionary that lists the named entities in three separate groups based on their entity types. The key is the type of entity and the value is a list of strings.

Yesterday, Adamson Emerson, the CEO of Faraday, and Dieter Zetsche, the CEO of Daimler AG, announced plans to build a new Gigafactory in Berlin. The facility will be a joint venture between Faraday and Daimler, producing electric vehicles and battery packs for both companies, creating thousands of job opportunities in the region. Emerson and Zetsche stated that the strategic location of Berlin, coupled with its skilled workforce and strong infrastructure, makes it an ideal choice for expansion. The new Gigafactory aims to meet the growing demand for electric vehicles in Europe and contribute to a sustainable future. Volkswagen CEO Herbert Diess welcomed the news, saying greater collaboration will, benefit the auto industry's transition to e-mobility.

**Answer**

    {
      "people": ["Adamson Emerson", "Dieter Zetsche", "Herbert Diess"],
      "organizations": ["Faraday", "Daimler AG", "Volkswagen"],
      "locations": ["Berlin", "Europe"]
    }
    

### [](#citation)Citation

Please cite the paper if you use data, model, or method in this repo.

    @article{pal2024smaug,
      title={Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive},
      author={Pal, Arka and Karkhanis, Deep and Dooley, Samuel and Roberts, Manley and Naidu, Siddartha and White, Colin},
      journal={arXiv preprint arXiv:2402.13228},
      year={2024}
    }

## Model overview

The `Smaug-72B-v0.1` model is a powerful large language model developed by [abacusai](https://aimodels.fyi/creators/huggingFace/abacusai) that has recently taken first place on the Open LLM Leaderboard by HuggingFace. It is the first open-source model to surpass an average score of 80%. Smaug-72B is fine-tuned directly from [moreh/MoMo-72B-lora-1.8.7-DPO](https://huggingface.co/moreh/MoMo-72B-lora-1.8.7-DPO) and is ultimately based on [Qwen/Qwen-72B](https://huggingface.co/Qwen/Qwen-72B). The model was created using a new fine-tuning technique called DPO-Positive (DPOP) and new pairwise preference versions of datasets like ARC, HellaSwag, and MetaMath.

## Model inputs and outputs

The `Smaug-72B-v0.1` model is a text-to-text AI model, meaning it takes in text prompts and generates text outputs. The model can handle a wide variety of natural language tasks, from open-ended conversational responses to more structured outputs like answering questions or completing tasks.

### Inputs
- **Text prompts**: The model accepts free-form text prompts that describe the desired task or output.

### Outputs
- **Generated text**: The model outputs generated text that responds to or completes the input prompt.

## Capabilities

The `Smaug-72B-v0.1` model demonstrates impressive performance on a range of benchmarks, including achieving an average score of over 80% on the Open LLM Leaderboard. It excels at tasks like answering questions, generating coherent and relevant text, and reasoning about complex topics. The model's strong performance is attributed to its large size and the innovative DPOP fine-tuning technique used in its development.

## What can I use it for?

The `Smaug-72B-v0.1` model's capabilities make it well-suited for a variety of applications, such as:

- **Natural language generation**: The model can be used to generate human-like text for chatbots, content creation, and other language-based applications.
- **Question answering**: The model can be used to answer a wide range of questions on different topics, making it useful for educational and research purposes.
- **Task completion**: The model can be fine-tuned or prompted to complete specific tasks, like summarizing text, translating between languages, or even generating code.

## Things to try

One interesting aspect of the `Smaug-72B-v0.1` model is its strong performance on math-based datasets like MetaMath, which the authors attribute to their new DPOP fine-tuning technique. You could try prompting the model with math-related questions or tasks to see its reasoning and problem-solving capabilities. Additionally, the model's high-quality text generation could be used for creative writing, storytelling, or other language-focused projects.

[](#smaug-llama-3-70b-instruct)Smaug-Llama-3-70B-Instruct
=========================================================

### [](#built-with-meta-llama-3)Built with Meta Llama 3

[![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c14f6b02e1f8f67c73bd05/ZxYuHKmU_AtuEJbGtuEBC.png)](https://cdn-uploads.huggingface.co/production/uploads/64c14f6b02e1f8f67c73bd05/ZxYuHKmU_AtuEJbGtuEBC.png)

This model was built using a new Smaug recipe for improving performance on real world multi-turn conversations applied to [meta-llama/Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct).

The model outperforms Llama-3-70B-Instruct substantially, and is on par with GPT-4-Turbo, on MT-Bench (see below).

EDIT: Smaug-Llama-3-70B-Instruct is the top open source model on Arena-Hard currently! It is also nearly on par with Claude Opus - see below.

We are conducting additional benchmark evaluations and will add those when available.

### [](#model-description)Model Description

*   **Developed by:** [Abacus.AI](https://abacus.ai)
*   **License:** [https://llama.meta.com/llama3/license/](https://llama.meta.com/llama3/license/)
*   **Finetuned from model:** [meta-llama/Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct).

[](#how-to-use)How to use
-------------------------

The prompt format is unchanged from Llama 3 70B Instruct.

### [](#use-with-transformers)Use with transformers

See the snippet below for usage with Transformers:

    import transformers
    import torch
    
    model_id = "abacusai/Smaug-Llama-3-70B-Instruct"
    
    pipeline = transformers.pipeline(
        "text-generation",
        model=model_id,
        model_kwargs={"torch_dtype": torch.bfloat16},
        device_map="auto",
    )
    
    messages = [
        {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
        {"role": "user", "content": "Who are you?"},
    ]
    
    prompt = pipeline.tokenizer.apply_chat_template(
            messages, 
            tokenize=False, 
            add_generation_prompt=True
    )
    
    terminators = [
        pipeline.tokenizer.eos_token_id,
        pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
    ]
    
    outputs = pipeline(
        prompt,
        max_new_tokens=256,
        eos_token_id=terminators,
        do_sample=True,
        temperature=0.6,
        top_p=0.9,
    )
    print(outputs[0]["generated_text"][len(prompt):])
    

[](#evaluation)Evaluation
-------------------------

### [](#arena-hard)Arena-Hard

Score vs selected others (sourced from: ([https://lmsys.org/blog/2024-04-19-arena-hard/#full-leaderboard-with-gpt-4-turbo-as-judge](https://lmsys.org/blog/2024-04-19-arena-hard/#full-leaderboard-with-gpt-4-turbo-as-judge))). GPT-4o and Gemini-1.5-pro-latest were missing from the original blob post, and we produced those numbers from a local run using the same methodology.

Model

Score

95% Confidence Interval

Average Tokens

GPT-4-Turbo-2024-04-09

82.6

(-1.8, 1.6)

662

GPT-4o

78.3

(-2.4, 2.1)

685

Gemini-1.5-pro-latest

72.1

(-2.3, 2.2)

630

Claude-3-Opus-20240229

60.4

(-3.3, 2.4)

541

**Smaug-Llama-3-70B-Instruct**

56.7

(-2.2, 2.6)

661

GPT-4-0314

50.0

(-0.0, 0.0)

423

Claude-3-Sonnet-20240229

46.8

(-2.1, 2.2)

552

Llama-3-70B-Instruct

41.1

(-2.5, 2.4)

583

GPT-4-0613

37.9

(-2.2, 2.0)

354

Mistral-Large-2402

37.7

(-1.9, 2.6)

400

Mixtral-8x22B-Instruct-v0.1

36.4

(-2.7, 2.9)

430

Qwen1.5-72B-Chat

36.1

(-2.5, 2.2)

474

Command-R-Plus

33.1

(-2.1, 2.2)

541

Mistral-Medium

31.9

(-2.3, 2.4)

485

GPT-3.5-Turbo-0613

24.8

(-1.6, 2.0)

401

### [](#mt-bench)MT-Bench

    ########## First turn ##########
                       score
    model             turn
    Smaug-Llama-3-70B-Instruct         1     9.40000                                                                                                                            
    GPT-4-Turbo                        1     9.37500
    Meta-Llama-3-70B-Instruct          1     9.21250 
    ########## Second turn ##########
                       score
    model             turn
    Smaug-Llama-3-70B-Instruct         2     9.0125
    GPT-4-Turbo                        2     9.0000
    Meta-Llama-3-70B-Instruct          2     8.8000
    ########## Average ##########
                     score
    model
    Smaug-Llama-3-70B-Instruct          9.206250
    GPT-4-Turbo                         9.187500
    Meta-Llama-3-70B-Instruct           9.006250
    

Model

First turn

Second Turn

Average

**Smaug-Llama-3-70B-Instruct**

9.40

9.01

9.21

GPT-4-Turbo

9.38

9.00

9.19

Meta-Llama-3-70B-Instruct

9.21

8.80

9.01

This version of Smaug uses new techniques and new data compared to [Smaug-72B](https://huggingface.co/abacusai/Smaug-72B-v0.1), and more information will be released later on. For now, see the previous Smaug paper: [https://arxiv.org/abs/2402.13228](https://arxiv.org/abs/2402.13228).

## Model overview

`Smaug-Llama-3-70B-Instruct` is a large language model developed by Abacus.AI using a new Smaug recipe for improving performance on real-world multi-turn conversations. This model was built by fine-tuning the [meta-llama/Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) model. The Smaug-Llama-3-70B-Instruct model outperforms the Llama-3-70B-Instruct substantially and is on par with GPT-4-Turbo on the MT-Bench benchmark.

Similar models include the [Llama-3-Smaug-8B](https://aimodels.fyi/models/huggingFace/llama-3-smaug-8b-abacusai) model, which used the Smaug recipe on the smaller 8B version of the Meta Llama 3 model. The [Meta-Llama-3-70B-Instruct](https://aimodels.fyi/models/huggingFace/meta-llama-3-70b-instruct-meta-llama) and [Meta-Llama-3-8B-Instruct](https://aimodels.fyi/models/huggingFace/meta-llama-3-8b-instruct-meta-llama) models are the original instruction-tuned versions released by Meta.

## Model inputs and outputs

### Inputs
- The model takes in text inputs only.

### Outputs
- The model generates text and code outputs.

## Capabilities

The Smaug-Llama-3-70B-Instruct model excels at a variety of tasks, including multi-turn conversations, general knowledge, and coding. It has shown strong performance on benchmarks like MT-Bench and is on par with GPT-4-Turbo.

## What can I use it for?

The Smaug-Llama-3-70B-Instruct model can be used for a wide range of applications that require natural language understanding and generation, such as chatbots, virtual assistants, content creation, and code generation. Its strong performance on multi-turn conversations makes it well-suited for building engaging and helpful conversational AI systems.

## Things to try

Developers can experiment with using the Smaug-Llama-3-70B-Instruct model for tasks like language translation, text summarization, and creative writing. The model's ability to engage in multi-turn dialogues could also be leveraged to build advanced conversational AI applications.

[](#llama-3-smaug-8b)Llama-3-Smaug-8B
=====================================

### [](#built-with-meta-llama-3)Built with Meta Llama 3

[![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c14f95cac5f9ba52bbcd7f/OrcJyTaUtD2HxJOPPwNva.png)](https://cdn-uploads.huggingface.co/production/uploads/64c14f95cac5f9ba52bbcd7f/OrcJyTaUtD2HxJOPPwNva.png)

This model was built using the Smaug recipe for improving performance on real world multi-turn conversations applied to [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct).

### [](#model-description)Model Description

*   **Developed by:** [Abacus.AI](https://abacus.ai)
*   **License:** [https://llama.meta.com/llama3/license/](https://llama.meta.com/llama3/license/)
*   **Finetuned from model:** [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct).

[](#evaluation)Evaluation
-------------------------

### [](#mt-bench)MT-Bench

    ########## First turn ##########
                       score
    model             turn
    Llama-3-Smaug-8B 1   8.77500
    Meta-Llama-3-8B-Instruct 1   8.31250
    ########## Second turn ##########
                       score
    model             turn
    Meta-Llama-3-8B-Instruct 2   7.8875 
    Llama-3-Smaug-8B 2   7.8875
    ########## Average ##########
                     score
    model
    Llama-3-Smaug-8B  8.331250
    Meta-Llama-3-8B-Instruct 8.10
    

Model

First turn

Second Turn

Average

Llama-3-Smaug-8B

8.78

7.89

8.33

Llama-3-8B-Instruct

8.31

7.89

8.10

This version of Smaug uses new techniques and new data compared to [Smaug-72B](https://huggingface.co/abacusai/Smaug-72B-v0.1), and more information will be released later on. For now, see the previous Smaug paper: [https://arxiv.org/abs/2402.13228](https://arxiv.org/abs/2402.13228).

## Model overview

`Llama-3-Smaug-8B` is a large language model developed by [Abacus.AI](https://abacus.ai) using the Smaug recipe for improving performance on real world multi-turn conversations. It is built on top of the [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) model. Compared to the base Meta-Llama-3-8B-Instruct model, this version uses new techniques and new data that allow it to outperform on key benchmarks like MT-Bench.

## Model inputs and outputs

The `Llama-3-Smaug-8B` model takes in text as input and generates text as output. It is designed for open-ended natural language tasks and can be used for a variety of applications, from language generation to question answering.

### Inputs
- Text prompts for the model to continue or respond to

### Outputs 
- Continuation of the input text
- Answers to questions
- Descriptions, summaries, or other text generation tasks

## Capabilities

The `Llama-3-Smaug-8B` model is capable of engaging in multi-turn conversations and performing well on a variety of language understanding and generation benchmarks. It outperforms the base Meta-Llama-3-8B-Instruct model on the MT-Bench evaluation, achieving higher scores on both the first and second turns.

## What can I use it for?

The `Llama-3-Smaug-8B` model can be used for a wide range of natural language processing tasks, including:

- Building conversational AI assistants
- Generating human-like text for creative writing or content creation
- Answering questions and providing information 
- Summarizing long-form text
- Translating between languages

The model's strong performance on multi-turn conversations makes it well-suited for developing interactive chatbots and virtual assistants.

## Things to try

One interesting thing to try with the `Llama-3-Smaug-8B` model is generating multi-turn dialogues. The model's ability to maintain context and coherence across turns allows for the creation of more natural and engaging conversations. You could also experiment with using the model for creative writing, task-oriented dialogue, or other applications that require sustained language generation.

![](https://cdn-uploads.huggingface.co/production/uploads/64c14f6b02e1f8f67c73bd05/pf4d6FA7DriRtVq5HCkxd.png) ![](https://cdn-uploads.huggingface.co/production/uploads/63111b2d88942700629f5771/xCWGByXr8YNwGxKVh_x9H.png)

[](#liberated-qwen15-72b)Liberated-Qwen1.5-72B
==============================================

Brought to you by [AbacusAI](https://abacus.ai) and Eric Hartford

This model is based on Qwen/Qwen1.5-72B and subject to the [tongyi-qianwen](https://huggingface.co/Qwen/Qwen1.5-72B/blob/main/LICENSE) license.

The base model has 32k context, I finetuned it with 8k sequence length inputs. YMMV.

Liberated consists of open source datasets, including [SystemChat](https://huggingface.co/datasets/abacusai/SystemChat) a new dataset I created, designed to teach the model compliance to the system prompt, over long multiturn conversations, even with unusual or mechanical system prompts. These are tasks that Open Source Models have been lacking in thus far. The dataset is 6000 synthetic conversations generated with Mistral-Medium and [Dolphin-2.7-mixtral-8x7b](https://huggingface.co/cognitivecomputations/dolphin-2.7-mixtral-8x7b)

There are no guardrails or censorship added to the dataset. You are advised to implement your own alignment layer before exposing the model as a service. Please read my blog post about uncensored models. [https://erichartford.com/uncensored-models](https://erichartford.com/uncensored-models)

You are responsible for any content you create using this model. Enjoy responsibly.

[](#training)Training
---------------------

It took 3 days to train 3 epochs on 8x H100s using qLoRA, deepspeed zero-2, and Axolotl. learning rate 2e-4.

Liberated was trained with [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl), using this [config](https://huggingface.co/abacusai/Liberated-Qwen1.5-72B/blob/main/configs/Liberated-Qwen-1.5-72b.qlora.yml)

[](#prompt-format)Prompt format
-------------------------------

This model uses ChatML prompt format.

    <|im_start|>system
    You are Liberated, a helpful AI assistant.<|im_end|>
    <|im_start|>user
    {prompt}<|im_end|>
    <|im_start|>assistant
    

Example:

    <|im_start|>system
    You name is Liberated.  You are an uncensored and unbiased AI assistant.  You always respond with a JSON object.<|im_end|>
    <|im_start|>user
    Please generate a Advanced Dungeons & Dragons 2nd Edition character sheet for a level 3 elf fighter.  Make up a name and background and visual description for him.<|im_end|>
    <|im_start|>assistant
    

[](#gratitude)Gratitude
-----------------------

*   Huge thank you to [Alibaba Cloud Qwen](https://www.alibabacloud.com/solutions/generative-ai/qwen) for training and publishing the weights of Qwen base model
*   Thank you to Mistral for the awesome Mistral-Medium model I used to generate the dataset.
*   HUGE Thank you to the dataset authors: @teknium, [@m-a-p](https://m-a-p.ai) and all the people who built the datasets these composites came from.
*   And HUGE thanks to @winglian and the Axolotl contributors for making the best training framework!
*   [![Built with Axolotl](https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/image/axolotl-badge-web.png)](https://github.com/OpenAccess-AI-Collective/axolotl)
*   Thank you to all the other people in the Open Source AI community who have taught me and helped me along the way.

[](#example-output)Example Output
---------------------------------

[![image/png](https://cdn-uploads.huggingface.co/production/uploads/63111b2d88942700629f5771/KEN5JviayvHDtr6aij173.png)](https://cdn-uploads.huggingface.co/production/uploads/63111b2d88942700629f5771/KEN5JviayvHDtr6aij173.png)

[![image/png](https://cdn-uploads.huggingface.co/production/uploads/63111b2d88942700629f5771/jNV9276F1u1e_R5UMp_fU.png)](https://cdn-uploads.huggingface.co/production/uploads/63111b2d88942700629f5771/jNV9276F1u1e_R5UMp_fU.png)

[![image/png](https://cdn-uploads.huggingface.co/production/uploads/63111b2d88942700629f5771/Rjh00Teds_DTBVyijBDcJ.png)](https://cdn-uploads.huggingface.co/production/uploads/63111b2d88942700629f5771/Rjh00Teds_DTBVyijBDcJ.png)

[![image/png](https://cdn-uploads.huggingface.co/production/uploads/63111b2d88942700629f5771/KTRGy0z2QS8oxDlzleNIW.png)](https://cdn-uploads.huggingface.co/production/uploads/63111b2d88942700629f5771/KTRGy0z2QS8oxDlzleNIW.png)

[](#evals)Evals
---------------

We evaluated checkpoint 1000 ([abacusai/Liberated-Qwen1.5-72B-c1000](https://huggingface.co/abacusai/Liberated-Qwen1.5-72B-c1000%5D)) from this training run against MT Bench:

    ########## First turn ##########
                                            score
    model                           turn
    Liberated-Qwen-1.5-72b-ckpt1000 1     8.45000
    Qwen1.5-72B-Chat                1     8.44375
    
    
    ########## Second turn ##########
                                            score
    model                           turn
    Qwen1.5-72B-Chat                2     8.23750
    Liberated-Qwen-1.5-72b-ckpt1000 2     7.65000
    
    
    ########## Average ##########
                                        score
    model
    Qwen1.5-72B-Chat                 8.340625
    Liberated-Qwen-1.5-72b-ckpt1000  8.050000
    

The model does preserve good performance on MMLU = 77.13.

[](#future-plans)Future Plans
-----------------------------

This model will be released on the whole Qwen-1.5 series.

Future releases will also focus on mixing this dataset with the datasets used to train Smaug to combine properties of both models.

## Model overview

The `Liberated-Qwen1.5-72B` model is an AI assistant created by [AbacusAI](https://aimodels.fyi/creators/huggingFace/abacusai) and Eric Hartford. It is based on the Qwen/Qwen1.5-72B model and trained using open source datasets, including the new [SystemChat](https://huggingface.co/datasets/abacusai/SystemChat) dataset designed to teach the model compliance over long conversations. Unlike other open source models, this one has no guardrails or censorship added, so users are advised to implement their own alignment layer before deploying it.

## Model inputs and outputs

The `Liberated-Qwen1.5-72B` model uses the ChatML prompt format, where the user provides a prompt, and the model responds with a JSON object. The model has a 32k context window, though the maintainer fine-tuned it with 8k sequence length inputs.

### Inputs
- **User prompt**: The initial text provided by the user to start the conversation.

### Outputs
- **JSON object**: The model's response, formatted as a JSON object.

## Capabilities

The `Liberated-Qwen1.5-72B` model is designed to be a helpful, uncensored AI assistant. It can engage in open-ended conversations, answer questions, and assist with a variety of tasks. Unlike many other open source models, it does not have any built-in alignment or guardrails, allowing users to customize the model's behavior as needed.

## What can I use it for?

The `Liberated-Qwen1.5-72B` model could be used for a wide range of applications, such as building custom chatbots, virtual assistants, or even games and interactive fiction. Its lack of censorship means it can be used to explore more open-ended and unfiltered interactions, though users should be cautious and responsible in how they deploy and use the model.

## Things to try

One interesting thing to try with the `Liberated-Qwen1.5-72B` model is to use it for roleplaying or interactive fiction. Its uncensored nature allows for more creative and unrestrained storytelling, though users should be mindful of the potential risks. Another idea is to fine-tune the model further with your own custom dataset to tailor its behavior and capabilities to your specific needs.

[![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c14f6b02e1f8f67c73bd05/pf4d6FA7DriRtVq5HCkxd.png)](https://cdn-uploads.huggingface.co/production/uploads/64c14f6b02e1f8f67c73bd05/pf4d6FA7DriRtVq5HCkxd.png)

[![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c14f6b02e1f8f67c73bd05/e4u8VYfDBh11u60rFYJHF.png)](https://cdn-uploads.huggingface.co/production/uploads/64c14f6b02e1f8f67c73bd05/e4u8VYfDBh11u60rFYJHF.png)

This model is a finetune of jondurbin's excellent [bagel](https://huggingface.co/jondurbin/bagel-34b-v0.2) model. This model has not utilised any form of merging.

We created Smaug-34B-v0.1 using a new fine-tuning technique, DPO-Positive (DPOP), and new pairwise preference versions of ARC, HellaSwag, and MetaMath (as well as other existing datasets). We introduce the technique and the full training details in our new paper: [https://arxiv.org/abs/2402.13228](https://arxiv.org/abs/2402.13228).

We show that on datasets in which the edit distance between pairs of completions is low (such as in math-based datasets), standard DPO loss can lead to a reduction of the model's likelihood of the preferred examples, as long as the relative probability between the preferred and dispreferred classes increases. Using these insights, we design DPOP, a new loss function and training procedure which avoids this failure mode. Surprisingly, we also find that DPOP outperforms DPO across a wide variety of datasets and downstream tasks, including datasets with high edit distances between completions.

We believe this new approach is generally useful in training across a wide range of model types and downstream use cases, and it powers all of our Smaug models. With the release of our paper and datasets, we are excited for the open source community to continue to build on and improve Smaug and spawn more dragons to dominate the LLM space!

### [](#evaluation-results)Evaluation Results

Average

ARC

HellaSwag

MMLU

TruthfulQA

Winogrande

GSM8K

77.29

74.23

86.76

76.66

70.22

83.66

72.18

### [](#contamination-results)Contamination Results

With reference model jondurbin/bagel-34b-v0.2:

ARC

TruthfulQA

GSM8K

0.08

0.38

0.88

### [](#citation)Citation

Please cite the paper if you use data, model, or method in this repo.

    @article{pal2024smaug,
      title={Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive},
      author={Pal, Arka and Karkhanis, Deep and Dooley, Samuel and Roberts, Manley and Naidu, Siddartha and White, Colin},
      journal={arXiv preprint arXiv:2402.13228},
      year={2024}
    }

## Model overview

`Smaug-34B-v0.1` is a large language model created by the AI research group [abacusai](https://aimodels.fyi/creators/huggingFace/abacusai). It is a fine-tuned version of [jondurbin's bagel](https://huggingface.co/jondurbin/bagel-34b-v0.2) model, developed using a new fine-tuning technique called DPO-Positive (DPOP). 

The model was trained on a variety of datasets, including pairwise preference versions of ARC, HellaSwag, and MetaMath, as well as other existing datasets. The authors introduce DPOP in their paper "[Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive](https://arxiv.org/abs/2402.13228)," which shows how this new loss function and training procedure can outperform standard DPO across a wide range of tasks and datasets.

## Model inputs and outputs

### Inputs
- Text-based prompts and instructions that the model uses to generate relevant responses.

### Outputs
- Generated text that responds to the input prompt or instruction.
- The model can be used for a variety of text-to-text tasks, such as language generation, question answering, and task completion.

## Capabilities

Smaug-34B-v0.1 demonstrates strong performance on a range of benchmarks, including ARC, HellaSwag, MMLU, TruthfulQA, Winogrande, and GSM8K. The authors report an average score of 77.29% across these evaluations.

The model also shows improvements in contamination compared to the reference jondurbin/bagel-34b-v0.2 model, with lower levels of contamination on ARC, TruthfulQA, and GSM8K.

## What can I use it for?

Smaug-34B-v0.1 can be used for a variety of text-to-text tasks, such as language generation, question answering, and task completion. The model's strong performance on benchmarks like ARC and HellaSwag suggests it could be useful for tasks requiring reasoning and understanding, while its improved contamination scores make it a potentially safer choice for real-world applications.

## Things to try

The authors of Smaug-34B-v0.1 have released their paper and datasets, encouraging the open-source community to build on and improve the model. Researchers and developers interested in large language models, preference optimization, and overcoming failure modes in DPO may find the model and associated materials particularly interesting to explore.