[](#weblab-10b-instruction-sft)weblab-10b-instruction-sft
=========================================================

[](#overview)Overview
=====================

This repository provides a Japanese-centric multilingual GPT-NeoX model of 10 billion parameters.

*   **Library**
    
    The model was trained using code based on [EleutherAI/gpt-neox](https://github.com/EleutherAI/gpt-neox).
    
*   **Model architecture**
    
    A 36-layer, 4864-hidden-size transformer-based language model.
    
*   **Pre-training**
    
    The model was trained on around **600B** tokens from a mixture of the following corpora.
    
    *   [Japanese C4](https://huggingface.co/datasets/mc4)
    *   [The Pile](https://huggingface.co/datasets/EleutherAI/pile)
*   **Instruction-supervised-finetuning**
    
    The model was finetuned on a subset records from a mixture of the following dataset. Training epoch: 1.
    
    *   [Alpaca (English)](https://github.com/gururise/AlpacaDataCleaned/blob/main/alpaca_data_cleaned.json)
    *   [Alpaca (Japanese translation)](https://github.com/shi3z/alpaca_ja/blob/main/alpaca_cleaned_ja.json)
    *   [Flan 2021 (English)](https://huggingface.co/datasets/conceptofmind/flan2021_submix_original)
    *   [Flan CoT (English)](https://huggingface.co/datasets/conceptofmind/cot_submix_original)
    *   [Flan Dialog (English)](https://huggingface.co/datasets/conceptofmind/dialog_submix_original)
*   **Model Series**
    
    Variant
    
    Link
    
    weblab-10b-instruction-sft
    
    [https://huggingface.co/matsuo-lab/weblab-10b-instruction-sft](https://huggingface.co/matsuo-lab/weblab-10b-instruction-sft)
    
    weblab-10b
    
    [https://huggingface.co/matsuo-lab/weblab-10b](https://huggingface.co/matsuo-lab/weblab-10b)
    
*   **Authors**
    
    Takeshi Kojima
    

* * *

[](#benchmarking)Benchmarking
=============================

*   **Japanese benchmark : JGLUE 8-task (2023-08-27)**
    
    *   _We used [Stability-AI/lm-evaluation-harness](https://github.com/Stability-AI/lm-evaluation-harness/tree/2f1583c0735eacdfdfa5b7d656074b69577b6774) library for evaluation._
    *   _The 8-task average accuracy is based on results of JCommonsenseQA-1.1, JNLI-1.1, MARC-ja-1.1, JSQuAD-1.1, jaqket\_v2-0.2, xlsum\_ja-1.0, xwinograd\_ja, and mgsm-1.0._
    *   _model loading is performed with float16, and evaluation is performed with template version 0.3 using the few-shot in-context learning._
    *   _The number of few-shots is 3,3,3,2,1,1,0,5._
    *   _special\_tokens\_map.json is modified to avoid errors during the evaluation of the second half benchmarks. As a result, the results of the first half benchmarks became slightly different._
    
    model
    
    average
    
    jcommonsenseqa
    
    jnli
    
    marc\_ja
    
    jsquad
    
    jaqket\_v2
    
    xlsum\_ja
    
    xwinograd\_ja
    
    mgsm
    
    weblab-10b-instruction-sft
    
    59.11
    
    74.62
    
    66.56
    
    95.49
    
    78.34
    
    63.32
    
    20.57
    
    71.95
    
    2
    
    weblab-10b
    
    50.74
    
    66.58
    
    53.74
    
    82.07
    
    62.94
    
    56.19
    
    10.03
    
    71.95
    
    2.4
    
*   **Japanese benchmark : JGLUE 4-task (2023-08-18)**
    
    *   _We used [Stability-AI/lm-evaluation-harness](https://github.com/Stability-AI/lm-evaluation-harness/tree/2f1583c0735eacdfdfa5b7d656074b69577b6774) library for evaluation._
    *   _The 4-task average accuracy is based on results of JCommonsenseQA-1.1, JNLI-1.1, MARC-ja-1.1, and JSQuAD-1.1._
    *   _model loading is performed with float16, and evaluation is performed with template version 0.3 using the few-shot in-context learning._
    *   _The number of few-shots is 3,3,3,2._
    
    Model
    
    Average
    
    JCommonsenseQA
    
    JNLI
    
    MARC-ja
    
    JSQuAD
    
    weblab-10b-instruction-sft
    
    78.78
    
    74.35
    
    65.65
    
    96.06
    
    79.04
    
    weblab-10b
    
    66.38
    
    65.86
    
    54.19
    
    84.49
    
    60.98
    

* * *

[](#how-to-use-the-model)How to use the model
=============================================

    import torch
    from transformers import AutoTokenizer, AutoModelForCausalLM
    
    tokenizer = AutoTokenizer.from_pretrained("matsuo-lab/weblab-10b-instruction-sft")
    model = AutoModelForCausalLM.from_pretrained("matsuo-lab/weblab-10b-instruction-sft", torch_dtype=torch.float16)
    
    if torch.cuda.is_available():
        model = model.to("cuda")
    
    text = ""
    text = f'\n\n### :\n{text}\n\n### :'
    token_ids = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt")
    
    with torch.no_grad():
        output_ids = model.generate(
            token_ids.to(model.device),
            max_new_tokens=100,
            do_sample=True,
            temperature=0.7,
            top_p=0.95
        )
    
    output = tokenizer.decode(output_ids.tolist()[0])
    print(output)
    

* * *

[](#licenese)Licenese
=====================

[cc-by-nc-4.0](https://creativecommons.org/licenses/by-nc/4.0/)

## Model overview

The `weblab-10b-instruction-sft` is a Japanese-centric multilingual GPT-NeoX model with 10 billion parameters. Trained using code based on [EleutherAI/gpt-neox](https://github.com/EleutherAI/gpt-neox), it has a 36-layer, 4864-hidden-size transformer architecture. The model was pre-trained on around 600B tokens from a mixture of the [Japanese C4](https://huggingface.co/datasets/mc4) and [The Pile](https://huggingface.co/datasets/EleutherAI/pile) datasets. It was then finetuned on a subset of records from datasets like [Alpaca (English)](https://github.com/gururise/AlpacaDataCleaned/blob/main/alpaca_data_cleaned.json), [Alpaca (Japanese translation)](https://github.com/shi3z/alpaca_ja/blob/main/alpaca_cleaned_ja.json), and others to serve as an instruction-following conversational agent.

This model can be contrasted with the [japanese-gpt-neox-3.6b-instruction-sft](https://aimodels.fyi/models/huggingFace/japanese-gpt-neox-36b-instruction-sft-rinna) model, which is a 3.6 billion parameter Japanese GPT-NeoX model that has also been finetuned for instruction following. The key differences are the larger parameter size and broader pre-training dataset of the `weblab-10b-instruction-sft` model.

## Model inputs and outputs

### Inputs
- **Text prompts**: The model takes in text prompts, which can include multi-turn conversations or instructions for the model to follow.

### Outputs 
- **Generated text**: The model outputs generated text that continues or responds to the provided prompt. This can include generating coherent, contextual responses to instructions or conversational prompts.

## Capabilities

The `weblab-10b-instruction-sft` model can be used for a variety of language generation and understanding tasks, particularly ones involving Japanese. It demonstrates strong performance on the JGLUE 8-task evaluation, achieving high accuracy on tasks like JCommonsenseQA, JNLI, and MARC-ja. The model's large size and broad training data allow it to generate fluent, contextual responses to open-ended prompts, making it suitable for applications like chatbots and language assistants.

## What can I use it for?

The `weblab-10b-instruction-sft` model could be a good starting point for building Japanese-language chatbots, virtual assistants, or other applications that require fluent text generation and language understanding. Its multilingual capabilities also allow it to potentially be used for cross-lingual applications. However, as with any large language model, it's important to carefully curate and filter the model's outputs to ensure safety and mitigate potential biases or inaccuracies.

## Things to try

One interesting aspect of the `weblab-10b-instruction-sft` model is its ability to follow instructions and engage in open-ended dialogue. Prompts that involve multi-turn conversations or provide specific tasks or objectives for the model to complete could be a productive area to explore, leveraging the model's strong performance on the JGLUE benchmarks. Experimenting with different prompting techniques and finetuning approaches may also help unlock the model's full potential for downstream applications.