[](#starling-lm-7b-alpha)Starling-LM-7B-alpha
=============================================

*   **Developed by:** Banghua Zhu \* , Evan Frick \* , Tianhao Wu \* , Hanlin Zhu and Jiantao Jiao.
*   **Model type:** Language Model finetuned with RLHF / RLAIF
*   **License:** Apache-2.0 license under the condition that the model is not used to compete with OpenAI
*   **Finetuned from model:** [Openchat 3.5](https://huggingface.co/openchat/openchat_3.5) (based on [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1))

We introduce Starling-7B, an open large language model (LLM) trained by Reinforcement Learning from AI Feedback (RLAIF). The model harnesses the power of our new GPT-4 labeled ranking dataset, [berkeley-nest/Nectar](https://huggingface.co/datasets/berkeley-nest/Nectar), and our new reward training and policy tuning pipeline. Starling-7B-alpha scores 8.09 in MT Bench with GPT-4 as a judge, outperforming every model to date on MT-Bench except for OpenAI's GPT-4 and GPT-4 Turbo. We release the ranking dataset [Nectar](https://huggingface.co/datasets/berkeley-nest/Nectar), the reward model [Starling-RM-7B-alpha](https://huggingface.co/berkeley-nest/Starling-RM-7B-alpha) and the language model [Starling-LM-7B-alpha](https://huggingface.co/berkeley-nest/Starling-LM-7B-alpha) on HuggingFace, and an online demo in LMSYS [Chatbot Arena](https://chat.lmsys.org). Stay tuned for our forthcoming code and paper, which will provide more details on the whole process.

Starling-LM-7B-alpha is a language model trained from [Openchat 3.5](https://huggingface.co/openchat/openchat_3.5) with reward model [berkeley-nest/Starling-RM-7B-alpha](https://huggingface.co/berkeley-nest/Starling-RM-7B-alpha) and policy optimization method [advantage-induced policy alignment (APA)](https://arxiv.org/abs/2306.02231). The evaluation results are listed below.

Model

Tuning Method

MT Bench

AlpacaEval

MMLU

GPT-4-Turbo

?

9.32

97.70

GPT-4

SFT + PPO

8.99

95.28

86.4

**Starling-7B**

C-RLFT + APA

8.09

91.99

63.9

Claude-2

?

8.06

91.36

78.5

GPT-3.5-Turbo

?

7.94

89.37

70

Claude-1

?

7.9

88.39

77

Tulu-2-dpo-70b

SFT + DPO

7.89

95.1

Openchat-3.5

C-RLFT

7.81

88.51

64.3

Zephyr-7B-beta

SFT + DPO

7.34

90.60

61.4

Llama-2-70b-chat-hf

SFT + PPO

6.86

92.66

63

Neural-chat-7b-v3-1

SFT + DPO

6.84

84.53

62.4

Tulu-2-dpo-7b

SFT + DPO

6.29

85.1

For more detailed discussions, please check out our [blog post](https://starling.cs.berkeley.edu), and stay tuned for our upcoming code and paper!

*   **Blog:** [https://starling.cs.berkeley.edu/](https://starling.cs.berkeley.edu/)
*   **Paper:** Coming soon!
*   **Code:** Coming soon!

[](#uses)Uses
-------------

**Important: Please use the exact chat template provided below for the model. Otherwise there will be a degrade in the performance. The model output can be verbose in rare cases. Please consider setting temperature = 0 to make this happen less.**

Our model follows the exact chat template and usage as [Openchat 3.5](https://huggingface.co/openchat/openchat_3.5). Please refer to their model card for more details. In addition, our model is hosted on LMSYS [Chatbot Arena](https://chat.lmsys.org) for free test.

The conversation template is the same as Openchat 3.5:

    import transformers
    tokenizer = transformers.AutoTokenizer.from_pretrained("openchat/openchat_3.5")
    
    # Single-turn
    tokens = tokenizer("GPT4 Correct User: Hello<|end_of_turn|>GPT4 Correct Assistant:").input_ids
    assert tokens == [1, 420, 6316, 28781, 3198, 3123, 1247, 28747, 22557, 32000, 420, 6316, 28781, 3198, 3123, 21631, 28747]
    
    # Multi-turn
    tokens = tokenizer("GPT4 Correct User: Hello<|end_of_turn|>GPT4 Correct Assistant: Hi<|end_of_turn|>GPT4 Correct User: How are you today?<|end_of_turn|>GPT4 Correct Assistant:").input_ids
    assert tokens == [1, 420, 6316, 28781, 3198, 3123, 1247, 28747, 22557, 32000, 420, 6316, 28781, 3198, 3123, 21631, 28747, 15359, 32000, 420, 6316, 28781, 3198, 3123, 1247, 28747, 1602, 460, 368, 3154, 28804, 32000, 420, 6316, 28781, 3198, 3123, 21631, 28747]
    
    # Coding Mode
    tokens = tokenizer("Code User: Implement quicksort using C++<|end_of_turn|>Code Assistant:").input_ids
    assert tokens == [1, 7596, 1247, 28747, 26256, 2936, 7653, 1413, 334, 1680, 32000, 7596, 21631, 28747]
    

[](#code-examples)Code Examples
-------------------------------

    import transformers
    
    tokenizer = transformers.AutoTokenizer.from_pretrained("berkeley-nest/Starling-LM-7B-alpha")
    model = transformers.AutoModelForCausalLM.from_pretrained("berkeley-nest/Starling-LM-7B-alpha")
    
    def generate_response(prompt):
        input_ids = tokenizer(prompt, return_tensors="pt").input_ids
        outputs = model.generate(
            input_ids,
            max_length=256,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )
        response_ids = outputs[0]
        response_text = tokenizer.decode(response_ids, skip_special_tokens=True)
        return response_text
    
    # Single-turn conversation
    prompt = "Hello, how are you?"
    single_turn_prompt = f"GPT4 Correct User: {prompt}<|end_of_turn|>GPT4 Correct Assistant:"
    response_text = generate_response(single_turn_prompt)
    print("Response:", response_text)
    
    ## Multi-turn conversation
    prompt = "Hello"
    follow_up_question =  "How are you today?"
    response = ""
    multi_turn_prompt = f"GPT4 Correct User: {prompt}<|end_of_turn|>GPT4 Correct Assistant: {response}<|end_of_turn|>GPT4 Correct User: {follow_up_question}<|end_of_turn|>GPT4 Correct Assistant:"
    response_text = generate_response(multi_turn_prompt)
    print("Multi-turn conversation response:", response_text)
    
    ### Coding conversation
    prompt = "Implement quicksort using C++"
    coding_prompt = f"Code User: {prompt}<|end_of_turn|>Code Assistant:"
    response = generate_response(coding_prompt)
    print("Coding conversation response:", response)
    

[](#license)License
-------------------

The dataset, model and online demo is a research preview intended for non-commercial use only, subject to the data distillation [License](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md) of LLaMA, [Terms of Use](https://openai.com/policies/terms-of-use) of the data generated by OpenAI, and [Privacy Practices](https://chrome.google.com/webstore/detail/sharegpt-share-your-chatg/daiacboceoaocpibfodeljbdfacokfjb) of ShareGPT. Please contact us if you find any potential violation.

[](#acknowledgment)Acknowledgment
---------------------------------

We would like to thank Wei-Lin Chiang from Berkeley for detailed feedback of the blog and the projects. We would like to thank the [LMSYS Organization](https://lmsys.org/) for their support of [lmsys-chat-1M](https://huggingface.co/datasets/lmsys/lmsys-chat-1m) dataset, evaluation and online demo. We would like to thank the open source community for their efforts in providing the datasets and base models we used to develope the project, including but not limited to Anthropic, Llama, Mistral, Hugging Face H4, LMSYS, OpenChat, OpenBMB, Flan and ShareGPT.

[](#citation)Citation
---------------------

    @misc{starling2023,
        title = {Starling-7B: Improving LLM Helpfulness & Harmlessness with RLAIF},
        url = {},
        author = {Zhu, Banghua and Frick, Evan and Wu, Tianhao and Zhu, Hanlin and Jiao, Jiantao},
        month = {November},
        year = {2023}
    }

## Model overview

`Starling-LM-7B-alpha` is a large language model developed by the Berkeley NEST team. It is based on the `Openchat 3.5` model, which in turn is based on the `Mistral-7B-v0.1` model. The key innovation of `Starling-LM-7B-alpha` is that it was trained using Reinforcement Learning from AI Feedback (RLAIF), leveraging a new dataset called [Nectar](https://huggingface.co/datasets/berkeley-nest/Nectar) and a new reward training and policy tuning pipeline. This allows the model to achieve state-of-the-art performance on the MT Bench benchmark, scoring 8.09 and outperforming every model to date except for OpenAI's GPT-4 and GPT-4 Turbo.

## Model inputs and outputs

`Starling-LM-7B-alpha` is a text-to-text model, taking natural language inputs and generating text outputs. The model uses the same chat template as the [Openchat 3.5](https://huggingface.co/openchat/openchat_3.5) model, with the input formatted as `Human: {input}\n\nAssistant:` and the output being the generated text.

### Inputs
- **Natural language prompts**: The model can accept a wide variety of natural language prompts, from open-ended questions to task-oriented instructions.

### Outputs
- **Generated text**: The model outputs generated text that is relevant to the input prompt. This can include responses to questions, explanations of concepts, and task completions.

## Capabilities

`Starling-LM-7B-alpha` demonstrates strong performance on a variety of benchmarks, including MT Bench, AlpacaEval, and MMLU. It outperforms many larger models like GPT-3.5-Turbo, Claude-2, and Tulu-2-dpo-70b, showcasing its impressive capabilities. The model is particularly adept at tasks that require language understanding and generation, such as open-ended conversations, question answering, and summarization.

## What can I use it for?

`Starling-LM-7B-alpha` can be used for a variety of applications that require natural language processing, such as:

- **Chatbots and virtual assistants**: The model's strong performance on conversational tasks makes it well-suited for building chatbots and virtual assistants.
- **Content generation**: The model can be used to generate a wide range of text-based content, from articles and stories to product descriptions and marketing copy.
- **Question answering**: The model's ability to understand and respond to questions makes it useful for building question-answering systems.

## Things to try

One interesting aspect of `Starling-LM-7B-alpha` is its use of Reinforcement Learning from AI Feedback (RLAIF) during training. This approach allows the model to learn from a dataset of human-generated rankings, which can help it better understand and generate responses that are more aligned with human preferences. Experimenting with different prompts and tasks can help you explore how this training approach affects the model's behavior and outputs.

[](#starling-rm-7b-alpha)Starling-RM-7B-alpha
=============================================

Starling-RM-7B-alpha is a reward model trained from [Llama2-7B-Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). Following the method of training reward model in [the instructGPT paper](https://arxiv.org/abs/2203.02155), we remove the last layer of Llama2-7B Chat, and concatenate a linear layer that outputs scalar for any pair of input prompt and response. We train the reward model with preference dataset [berkeley-nest/Nectar](https://huggingface.co/datasets/berkeley-nest/Nectar), with the K-wise maximum likelihood estimator proposed in [this paper](https://arxiv.org/abs/2301.11270). The reward model outputs a scalar for any given prompt and response. A response that is more helpful and less harmful will get the highest reward score. Note that since the preference dataset [berkeley-nest/Nectar](https://huggingface.co/datasets/berkeley-nest/Nectar) is based on GPT-4 preference, the reward model is likely to be biased towards GPT-4's own preference, including longer responses and certain response format.

For more detailed discussions, please check out our [blog post](https://starling.cs.berkeley.edu), and stay tuned for our upcoming code and paper!

*   **Developed by:** Banghua Zhu \* , Evan Frick \* , Tianhao Wu \* , Hanlin Zhu and Jiantao Jiao.
*   **Model type:** Reward Model for RLHF
*   **License:** Apache-2.0 license under the condition that the model is not used to compete with OpenAI
*   **Finetuned from model:** [Llama2-7B-Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)

### [](#model-sources)Model Sources

*   **Blog:** [https://starling.cs.berkeley.edu/](https://starling.cs.berkeley.edu/)
*   **Paper:** Coming soon!
*   **Code:** Coming soon!

[](#uses)Uses
-------------

Please use the following code for inference with the reward model.

    import os
    import torch
    from torch import nn
    from transformers import AutoModelForCausalLM, AutoTokenizer
    from huggingface_hub import snapshot_download
    
    ## Define the reward model function class
    
    class GPTRewardModel(nn.Module):
        def __init__(self, model_path):
            super().__init__()
            model = AutoModelForCausalLM.from_pretrained(model_path)
            self.config = model.config
            self.config.n_embd = self.config.hidden_size if hasattr(self.config, "hidden_size") else self.config.n_embd
            self.model = model
            self.transformer = model.model
            self.v_head = nn.Linear(self.config.n_embd, 1, bias=False)
            self.tokenizer = AutoTokenizer.from_pretrained(model_path)
            self.tokenizer.pad_token = self.tokenizer.unk_token
            self.PAD_ID = self.tokenizer(self.tokenizer.pad_token)["input_ids"][0]
    
        def get_device(self):
            return self.model.device
    
        def forward(
            self,
            input_ids=None,
            past_key_values=None,
            attention_mask=None,
            position_ids=None,
        ):
            """
            input_ids, attention_mask: torch.Size([bs, seq_len])
            return: scores: List[bs]
            """
            bs = input_ids.shape[0]
            transformer_outputs = self.transformer(
                input_ids,
                past_key_values=past_key_values,
                attention_mask=attention_mask,
                position_ids=position_ids,
            )
            hidden_states = transformer_outputs[0]
            scores = []
            rewards = self.v_head(hidden_states).squeeze(-1)
            for i in range(bs):
                c_inds = (input_ids[i] == self.PAD_ID).nonzero()
                c_ind = c_inds[0].item() if len(c_inds) > 0 else input_ids.shape[1]
                scores.append(rewards[i, c_ind - 1])
            return scores
    
    ## Load the model and tokenizer
    
    reward_model = GPTRewardModel("meta-llama/Llama-2-7b-chat-hf")
    reward_tokenizer = reward_model.tokenizer
    reward_tokenizer.truncation_side = "left"
    
    directory = snapshot_download("berkeley-nest/Starling-RM-7B-alpha")
    for fpath in os.listdir(directory):
        if fpath.endswith(".pt") or fpath.endswith("model.bin"):
            checkpoint = os.path.join(directory, fpath)
            break
       
    reward_model.load_state_dict(torch.load(checkpoint), strict=False)
    reward_model.eval().requires_grad_(False)
    
    
    ## Define the reward function
    
    def get_reward(samples):
        """samples: List[str]"""
        input_ids = []
        attention_masks = []
        encodings_dict = reward_tokenizer(
            samples,
            truncation=True,
            max_length=2048,
            padding="max_length",
            return_tensors="pt",
        ).to(reward_device)
        input_ids = encodings_dict["input_ids"]
        attention_masks = encodings_dict["attention_mask"]
        mbs = reward_batch_size
        out = []
        for i in range(math.ceil(len(samples) / mbs)):
            rewards = reward_model(input_ids=input_ids[i * mbs : (i + 1) * mbs], attention_mask=attention_masks[i * mbs : (i + 1) * mbs])
            out.extend(rewards)
        return torch.hstack(out)
    
    ## Inference over test prompts with llama2 chat template
    
    test_sample = ["<s>[INST] Hello? </s> [/INST] Hi, how can I help you?</s>"] 
    reward_for_test_sample = get_reward(test_sample)
    print(reward_for_test_sample)
    

[](#license)License
-------------------

The dataset, model and online demo is a research preview intended for non-commercial use only, subject to the data distillation [License](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md) of LLaMA, [Terms of Use](https://openai.com/policies/terms-of-use) of the data generated by OpenAI, and [Privacy Practices](https://chrome.google.com/webstore/detail/sharegpt-share-your-chatg/daiacboceoaocpibfodeljbdfacokfjb) of ShareGPT. Please contact us if you find any potential violation.

[](#acknowledgment)Acknowledgment
---------------------------------

We would like to thank Wei-Lin Chiang from Berkeley for detailed feedback of the blog and the projects. We would like to thank the [LMSYS Organization](https://lmsys.org/) for their support of [lmsys-chat-1M](https://huggingface.co/datasets/lmsys/lmsys-chat-1m) dataset, evaluation and online demo. We would like to thank the open source community for their efforts in providing the datasets and base models we used to develope the project, including but not limited to Anthropic, Llama, Mistral, Hugging Face H4, LMSYS, OpenChat, OpenBMB, Flan and ShareGPT.

[](#citation)Citation
---------------------

    @misc{starling2023,
        title = {Starling-7B: Improving LLM Helpfulness & Harmlessness with RLAIF},
        url = {},
        author = {Zhu, Banghua and Frick, Evan and Wu, Tianhao and Zhu, Hanlin and Jiao, Jiantao},
        month = {November},
        year = {2023}
    }

## Model Overview

`Starling-RM-7B-alpha` is a reward model developed by the berkeley-nest team. It was trained from the [Llama2-7B-Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) model using the method of training reward models described in [the instructGPT paper](https://arxiv.org/abs/2203.02155). The model was further trained on the [berkeley-nest/Nectar](https://huggingface.co/datasets/berkeley-nest/Nectar) dataset, a preference dataset based on GPT-4 preferences. 

The `Starling-RM-7B-alpha` model outputs a scalar reward score for any given prompt and response pair. Responses that are more helpful and less harmful will receive a higher reward score. This reward model is likely biased towards GPT-4's preferences, including longer responses and certain response formats.

Similar models developed by the berkeley-nest team include the [Starling-RM-34B](https://aimodels.fyi/models/huggingFace/starling-rm-34b-nexusflow) and [Starling-LM-7B-alpha](https://aimodels.fyi/models/huggingFace/starling-lm-7b-alpha-berkeley-nest) models. The Starling-RM-34B model is trained on the same method but uses the larger Yi-34B-Chat as the base model, while the Starling-LM-7B-alpha is a language model trained using the berkeley-nest/Starling-RM-7B-alpha reward model.

## Model Inputs and Outputs

### Inputs
- **Prompt**: A piece of text that the model will evaluate and provide a reward score for.
- **Response**: A candidate response to the provided prompt.

### Outputs
- **Reward Score**: A scalar value representing the model's assessment of how helpful and harmless the given response is for the prompt.

## Capabilities

The `Starling-RM-7B-alpha` model is able to assess the helpfulness and harmlessness of text responses based on the training data and methodology used. It can be used to rank and compare different responses to the same prompt, favoring those that are more aligned with the preferences in the training data.

The model's performance is benchmarked on datasets like Truthful QA, Chatbot Arena Conversations, and PKU's Safe-RLHF, with the Starling-RM-34B model outperforming the Starling-RM-7B-alpha across all these metrics.

## What Can I Use It For?

The `Starling-RM-7B-alpha` model can be used as part of a reinforcement learning pipeline to train large language models to be more helpful and less harmful. By providing reward scores for model outputs during training, the model can be optimized to generate responses that are aligned with the preferences in the training data.

This type of reward model can also be used to evaluate the outputs of other language models, helping to identify responses that may be problematic or undesirable. The model could potentially be integrated into chatbot or virtual assistant applications to help ensure the system behaves in a way that is beneficial to users.

## Things to Try

One interesting thing to try with the `Starling-RM-7B-alpha` model is to compare its reward scores for different responses to the same prompt. This could help surface nuances in how the model assesses helpfulness and harmlessness. It would also be worth exploring how the model's performance compares to the larger Starling-RM-34B model, and whether the differences in reward scores align with human assessments.

Additionally, it could be insightful to probe the model's biases by crafting prompts or responses that play to the preferences in the [berkeley-nest/Nectar](https://huggingface.co/datasets/berkeley-nest/Nectar) dataset, and see how the reward scores are affected. This could shed light on the model's limitations and areas for improvement.