[](#starcoder-gpteacher-codegen-fine-tuned)StarCoder GPTeacher-Codegen Fine-Tuned
=================================================================================

This model is [`bigcode/starcoder`](https://huggingface.co/bigcode/starcoder) fine-tuned on the [`teknium1/GPTeacher`](https://github.com/teknium1/GPTeacher) codegen dataset (GPT-4 code instruction fine-tuning).

[](#model-details)Model Details
-------------------------------

The base StarCoder models are 15.5B parameter models trained on 80+ programming languages from [The Stack (v1.2)](https://huggingface.co/datasets/bigcode/the-stack), with opt-out requests excluded. The model uses [Multi Query Attention](https://arxiv.org/abs/1911.02150), [a context window of 8192 tokens](https://arxiv.org/abs/2205.14135), and was trained using the [Fill-in-the-Middle objective](https://arxiv.org/abs/2207.14255) on 1 trillion tokens.

*   **Repository:** [bigcode/Megatron-LM](https://github.com/bigcode-project/Megatron-LM)
*   **Project Website:** [bigcode-project.org](https://www.bigcode-project.org)
*   **Paper:** [StarCoder: May the source be with you!](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view)
*   **Point of Contact:** [contact@bigcode-project.org](mailto:contact@bigcode-project.org)
*   **Languages:** 80+ Programming languages

[](#uses)Uses
-------------

### [](#intended-use)Intended use

The base model was trained on GitHub code and then fine-tuned to follow instructions. Prompts such as "Write a function that computes the square root." should work reasonably well. The original repo recommeds using the [Tech Assistant prompt](https://huggingface.co/datasets/bigcode/ta-prompt) to few-shot prompt it into behaving as a technical assistant. This fine-tuned model uses the [Alpaca prompts](https://github.com/tatsu-lab/stanford_alpaca/blob/main/train.py).

### [](#generation)Generation

    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    checkpoint = "GeorgiaTechResearchInstitute/starcoder-gpteacher-code-instruct"
    device = "cuda"
    
    input_prompt = ("Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n"
                    "### Instruction:\n"
                    "{instruction}\n\n"
                    "### Input:\n"
                    "{input}\n\n"
                    "### Response:")
    
    prompt = "Please explain the following program."
    extra_input = "send(to, from, count)
                  register short *to, *from;
                  register count;
                  {
                    register n = (count + 7) / 8;
                    switch (count % 8) {
                    case 0: do { *to = *from++;
                    case 7:      *to = *from++;
                    case 6:      *to = *from++;
                    case 5:      *to = *from++;
                    case 4:      *to = *from++;
                    case 3:      *to = *from++;
                    case 2:      *to = *from++;
                    case 1:      *to = *from++;
                      } while (--n > 0);
                    }
                  }"
    prompt = input_prompt.format_map({"instruction": prompt, "input": extra_input})
                      
    tokenizer = AutoTokenizer.from_pretrained(checkpoint)
    model = AutoModelForCausalLM.from_pretrained(checkpoint, trust_remote_code=True, torch_dtype=torch.float16).to(device)
    
    inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
    outputs = model.generate(inputs)
    print(tokenizer.decode(outputs[0]))
    

Example Outputs

_**Full Prompt:**_

    Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
    ### Instruction:
    What does this program do and how does it work?
    ### Input:
    send(to, from, count)
    register short *to, *from;
    register count;
    {
        register n = (count + 7) / 8;
        switch (count % 8) {
            case 0: *to = *from++;
            case 7: *to = *from++;
            case 6: *to = *from++;
            case 5: *to = *from++;
            case 4: *to = *from++;
            case 3: *to = *from++;
            case 2: *to = *from++;
            case 1: *to = *from++;
        }
        while (--n > 0) {
            *to = *from++;
            *to = *from++;
            *to = *from++;
            *to = *from++;
            *to = *from++;
            *to = *from++;
            *to = *from++;
            *to = *from++;
        }
    }
    ### Response:
    

_**Response:**_

    This program is designed to efficiently copy bytes from one memory location to another. It copies 'count' bytes from the location pointed to by 'from' to the location pointed to by 'to', and works correctly regardless of address alignment. The function 'foo' is a wrapper function that initializes the 'to' and 'from' pointers, checks for proper memory bounds, and performs byte alignment if necessary.
    
    To use this program, you must declare two pointer variables, 'to' and 'from', and then call the 'foo' function with the following parameters:
    
    foo(to, from, count)

### [](#attribution--other-requirements)Attribution & Other Requirements

The pretraining dataset of the model was filtered for permissive licenses only. Nevertheless, the model can generate source code verbatim from the dataset. The code's license might require attribution and/or other specific requirements that must be respected. The BigCode project provides a [search index](https://huggingface.co/spaces/bigcode/starcoder-search) that let's you search through the pretraining data to identify where generated code came from and apply the proper attribution to your code.

[](#limitations)Limitations
===========================

The model has been trained on source code from 80+ programming languages. The predominant language in source is English although other languages are also present. As such the model is capable of generating code snippets provided some context but the generated code is not guaranteed to work as intended. It can be inefficient, contain bugs or exploits. See [the original paper](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) for an in-depth discussion of the model limitations. The fine-tuning process makes the model more responsive to direct user input, however this is an early attempt at instruction fine-tuning starcoder models and the results may not be representative of the model's full potential.

[](#training)Training
=====================

[](#model)Model
---------------

*   **Architecture:** GPT-2 model with multi-query attention and Fill-in-the-Middle objective
*   **Pretraining steps:** 250k
*   **Pretraining tokens:** 1 trillion
*   **Precision:** bfloat16
*   **Fine-Tuning Instruct-Response Pairs:** 4.5k
*   **Fine-Tuning Context Length:** 1024
*   **Fine-Tuning Epochs:** 3
*   **Fine-Tuning LR:** 2e-5
*   **Fine-Tuning Optimizations:** FSDP

[](#hardware)Hardware
---------------------

*   **GPUs:** 8 Tesla A100
*   **Training time:** 5 hours

[](#license)License
===================

The model is licensed under the BigCode OpenRAIL-M v1 license agreement. You can find the full agreement [here](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement). This model was also fine-tuned using outputs from OpenAI's GPT-4, and as such it is additionally subject to [OpenAI's terms of service.](https://openai.com/policies/terms-of-use)

[](#citation)Citation
---------------------

The base model HF repo can be found [here.](https://huggingface.co/bigcode/starcoder)

    @article{li2023starcoder,
          title={StarCoder: may the source be with you!}, 
          author={Raymond Li and Loubna Ben Allal and Yangtian Zi and Niklas Muennighoff and Denis Kocetkov and Chenghao Mou and Marc Marone and Christopher Akiki and Jia Li and Jenny Chim and Qian Liu and Evgenii Zheltonozhskii and Terry Yue Zhuo and Thomas Wang and Olivier Dehaene and Mishig Davaadorj and Joel Lamy-Poirier and Joo Monteiro and Oleh Shliazhko and Nicolas Gontier and Nicholas Meade and Armel Zebaze and Ming-Ho Yee and Logesh Kumar Umapathi and Jian Zhu and Benjamin Lipkin and Muhtasham Oblokulov and Zhiruo Wang and Rudra Murthy and Jason Stillerman and Siva Sankalp Patel and Dmitry Abulkhanov and Marco Zocca and Manan Dey and Zhihan Zhang and Nour Fahmy and Urvashi Bhattacharyya and Wenhao Yu and Swayam Singh and Sasha Luccioni and Paulo Villegas and Maxim Kunakov and Fedor Zhdanov and Manuel Romero and Tony Lee and Nadav Timor and Jennifer Ding and Claire Schlesinger and Hailey Schoelkopf and Jan Ebert and Tri Dao and Mayank Mishra and Alex Gu and Jennifer Robinson and Carolyn Jane Anderson and Brendan Dolan-Gavitt and Danish Contractor and Siva Reddy and Daniel Fried and Dzmitry Bahdanau and Yacine Jernite and Carlos Muoz Ferrandis and Sean Hughes and Thomas Wolf and Arjun Guha and Leandro von Werra and Harm de Vries},
          year={2023},
          eprint={2305.06161},
          archivePrefix={arXiv},
          primaryClass={cs.CL}
    }
    

[](#open-llm-leaderboard-evaluation-results)Open LLM Leaderboard Evaluation Results
===================================================================================

Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_GeorgiaTechResearchInstitute__starcoder-gpteacher-code-instruct)

Metric

Value

Avg.

32.57

ARC (25-shot)

32.68

HellaSwag (10-shot)

47.6

MMLU (5-shot)

28.63

TruthfulQA (0-shot)

40.41

Winogrande (5-shot)

55.56

GSM8K (5-shot)

0.0

DROP (3-shot)

23.11

## Model overview

The `starcoder-gpteacher-code-instruct` model is a fine-tuned version of the [BigCode StarCoder](https://aimodels.fyi/models/huggingFace/starcoder-bigcode) model that has been trained on the [GPTeacher codegen dataset](https://github.com/teknium1/GPTeacher). This model is maintained by the [Georgia Tech Research Institute](https://aimodels.fyi/creators/huggingFace/GeorgiaTechResearchInstitute).

The base StarCoder models are 15.5B parameter models trained on over 80 programming languages from [The Stack (v1.2)](https://huggingface.co/datasets/bigcode/the-stack) dataset. They use Multi-Query Attention, a context window of 8192 tokens, and were trained using the Fill-in-the-Middle objective on 1 trillion tokens.

## Model inputs and outputs

### Inputs
- **Instruction**: A text prompt describing a task for the model to complete, such as "Write a function that computes the square root."
- **Input**: Additional context information that the model can use to generate the requested output.

### Outputs
- **Response**: The model's attempt at completing the requested task, generating code or text to fulfill the instruction.

## Capabilities

The `starcoder-gpteacher-code-instruct` model is capable of following code-related instructions and generating relevant responses. For example, given the prompt "Write a function that computes the square root", the model may generate the following Python function:

```python
import math

def sqrt(x):
    return math.sqrt(x)
```

## What can I use it for?

The `starcoder-gpteacher-code-instruct` model could be useful for a variety of applications that require generating code or text based on instructions, such as:

- Automated code generation and assisted programming
- Technical assistance and question-answering for developers
- Prototyping and experimentation with new ideas

## Things to try

One interesting thing to try with the `starcoder-gpteacher-code-instruct` model is using the [Tech Assistant prompt](https://huggingface.co/datasets/bigcode/ta-prompt) to prompt it into behaving as a technical assistant. This can help the model better understand and respond to code-related instructions.

Another idea is to experiment with the model's ability to generate code in different programming languages, by providing instructions that specify the desired language.

[](#galpaca-30b-large)GALPACA 30B (large)
=========================================

GALACTICA 30B fine-tuned on the Alpaca dataset.

The model card from the original Galactica repo can be found [here](https://github.com/paperswithcode/galai/blob/main/docs/model_card.md), and the original paper [here](https://galactica.org/paper.pdf).

The dataset card for Alpaca can be found [here](https://huggingface.co/datasets/tatsu-lab/alpaca/blob/main/README.md), and the project homepage [here](https://crfm.stanford.edu/2023/03/13/alpaca.html). The Alpaca dataset was collected with a modified version of the [Self-Instruct Framework](https://github.com/yizhongw/self-instruct), and was built using OpenAI's `text-davinci-003` model. As such it is subject to OpenAI's terms of service.

[](#model-details)Model Details
-------------------------------

The GALACTICA models are trained on a large-scale scientific corpus and are designed to perform scientific tasks. The Alpaca dataset is a set of 52k instruct-response pairs designed to enhace the instruction following capabilites of pre-trained language models.

[](#model-use)Model Use
-----------------------

The GALACTICA model card specifies that the primary indended users of the GALACTICA models are researchers studying language models applied to the scientific domain, and it cautions against production use of GALACTICA without safeguards due to the potential for the model to produce inaccurate information. The original GALACTICA models are available under a non-commercial CC BY-NC 4.0 license, and the GALPACA model is additionally subject to the [OpenAI Terms of Service](https://openai.com/policies/terms-of-use).

Example Outputs

_**Full Prompt:**_

    Below is an instruction that describes a task. As a scientist, write a response that appropriately completes the request, citing specific papers whenever possible.
    
    ### Instruction:
    Write out Maxwell's equations and explain each one.
    
    ### Markdown and LaTeX Response:
    

_**Response:**_

Maxwell's equations are a set of four equations that describe the behavior of electromagnetic fields. These equations are: E\=Bt\\nabla \\times \\textbf{E} = -\\frac{\\partial \\textbf{B}}{\\partial t}E\=tB B\=00Et+0J\\nabla \\times \\textbf{B} = \\mu\_0 \\epsilon\_0 \\frac{\\partial \\textbf{E}}{\\partial t} + \\mu\_0 \\textbf{J}B\=00tE+0J E\=0\\nabla \\bullet \\textbf{E} = \\frac{\\rho}{\\epsilon\_0}E\=0 B\=0\\nabla \\bullet \\textbf{B} = 0B\=0 The first two equations describe how the electric and magnetic fields, E and B, are related to each other. The third and fourth equations are Gauss's law for electric and magnetic fields, respectively. These equations describe the behavior of the electric and magnetic fields when there are charges and currents in the space. Each equation has a specific purpose and together they form a complete description of the behavior of electromagnetic fields.

_**Full Prompt:**_

    Below is an instruction that describes a task. Write a response that appropriately completes the request.
    
    ### Instruction:
    Write a python function implements the Sherman-Morrison matrix inversion lemma using numpy.
    
    ### Response:
    ```python
    

_**Response:**_

    def sm_inv_lemma(A, u, v):
      """ Function implementing the Sherman-Morrison matrix inversion lemma using numpy.
      
      Parameters
      ----------
      A : numpy array
          The original matrix.
      
      u : numpy array
          The first vector in the Sherman-Morrison lemma.
      
      v : numpy array
          The second vector in the Sherman-Morrison lemma.
      
      Returns
      -------
      The inverse matrix of A - uv^T.
      """
      A_inv = np.linalg.inv(A)
      alpha = 1 / (1 + v.T @ A_inv @ u)
      return A_inv - (alpha * A_inv @ u @ v.T @ A_inv)

[](#training-data)Training Data
-------------------------------

The GALPACA models are trained by fine-tuning pre-trained GALACTICA models on the Alpaca dataset. GALACTICA models were trained on 106 billion tokens of open-access scientific text and data, including papers, textbooks, scientific websites, encyclopedias, and more. Fine-tuning the base GALACTICA models on the 52k instruction-response pairs in the Alpaca dataset allows users to query the GALPACA models in an instruct-response fashion.

[](#how-to-use)How to Use
-------------------------

The GALPACA weights are made available for use with the `transformers` library.

Click to expand

    # pip install accelerate
    from transformers import AutoTokenizer, OPTForCausalLM
    
    tokenizer = AutoTokenizer.from_pretrained("GeorgiaTechResearchInstitute/galpaca-30b")
    model = OPTForCausalLM.from_pretrained("GeorgiaTechResearchInstitute/galpaca-30b", device_map="auto", torch_dtype=torch.float16)
    
    # see the original Alpaca repo for more information about the prompt templates
    no_input_prompt_template = ("Below is an instruction that describes a task. "
                                "Write a response that appropriately completes the request.\n\n"
                                "### Instruction:\n{instruction}\n\n### Response:")
    prompt = "Write out Maxwell's equations and explain the meaning of each one."
    formatted_prompt = no_input_prompt_template.format_map({'instruction': prompt})
    
    tokenized_prompt = tokenizer(formatted_prompt, return_tensors="pt").input_ids.to(model.device)
    out_tokens = model.generate(tokenized_prompt)
    
    print(tokenizer.batch_decode(out_tokens, skip_special_tokens=False, clean_up_tokenization_spaces=False))

[](#training-resources)Training Resources
-----------------------------------------

GALPACA 30B was fine-tuned in about 6 hours using 16 A100 80GB GPUS, 16-bit mixed-precision, an effective batch-size of 1024, and with a maximum context window of 384 tokens. This model was trained using DeepSpeed ZeRO Stage 3 optimizations.

[](#performance-and-limitations)Performance and Limitations
-----------------------------------------------------------

Qualitative evaluation suggests that Galpaca frequently outperforms LLaMA-based Alpaca models on tasks related to technical knowledge and programming, while it underperforms on natural langauge tasks such as generating prose. More information about the performance and limitations of the GALACTICA family of models can be found on the original GALACTICA model card.

[](#works-cited)Works Cited
---------------------------

    @inproceedings{GALACTICA,
        title={GALACTICA: A Large Language Model for Science},
        author={Ross Taylor and Marcin Kardas and Guillem Cucurull and Thomas Scialom and Anthony Hartshorn and Elvis Saravia and Andrew Poulton and Viktor Kerkez and Robert Stojnic},
        year={2022}
    }
    

    @misc{alpaca,
      author = {Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto },
      title = {Stanford Alpaca: An Instruction-following LLaMA model},
      year = {2023},
      publisher = {GitHub},
      journal = {GitHub repository},
      howpublished = {\url{https://github.com/tatsu-lab/stanford_alpaca}},
    }
    

[](#open-llm-leaderboard-evaluation-results)Open LLM Leaderboard Evaluation Results
===================================================================================

Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_GeorgiaTechResearchInstitute__galpaca-30b)

Metric

Value

Avg.

40.99

ARC (25-shot)

49.57

HellaSwag (10-shot)

58.2

MMLU (5-shot)

43.78

TruthfulQA (0-shot)

41.16

Winogrande (5-shot)

62.51

GSM8K (5-shot)

2.81

DROP (3-shot)

28.89

## Model overview

The `galpaca-30b` is a large language model developed by the Georgia Tech Research Institute. It is a fine-tuned version of the GALACTICA 30B model, which was trained on a large-scale scientific corpus to perform a variety of scientific tasks. The GALACTICA models range in size from 125M to 120B parameters, with the `galpaca-30b` being the "large" 30B parameter variant. 

The `galpaca-30b` model was further fine-tuned on the [Alpaca dataset](https://huggingface.co/datasets/tatsu-lab/alpaca/blob/main/README.md), a collection of 52K instruction-response pairs designed to enhance the instruction-following capabilities of pre-trained language models. This fine-tuning was done using a modified version of the [Self-Instruct Framework](https://github.com/yizhongw/self-instruct).

## Model inputs and outputs

### Inputs
- **Freeform text**: The `galpaca-30b` model can accept arbitrary freeform text as input, such as instructions, questions, or prompts.

### Outputs
- **Generated text**: Based on the input text, the model will generate relevant output text. This can include answers to questions, responses to instructions, or continuations of the provided prompt.

## Capabilities

The `galpaca-30b` model demonstrates strong performance on a range of scientific tasks, including citation prediction, scientific question answering, mathematical reasoning, summarization, and more. It outperforms several existing language models on knowledge-intensive tasks, thanks to its large-scale training on scientific data.

However, the model is also prone to hallucination, meaning it can generate factually incorrect information, especially for less popular scientific concepts. Additionally, while the model exhibits lower toxicity levels compared to other large language models, it still exhibits some biases.

## What can I use it for?

The primary intended users of the GALACTICA models, including the `galpaca-30b`, are researchers studying the application of language models to scientific domains. The model could be used to build various scientific tooling, such as literature discovery, scientific question answering, and mathematical reasoning assistants.

That said, the maintainers caution against using the model in production environments without proper safeguards, due to the risk of hallucination and biases.

## Things to try

Given the model's strengths in scientific tasks, users may want to experiment with prompts related to various scientific fields, such as requesting explanations of scientific concepts, generating research paper abstracts, or solving mathematical problems. However, it's important to be aware of the model's limitations and not rely on its outputs as authoritative sources of information.