[](#replit-code-v1-3b)replit-code-v1-3b
=======================================

Developed by: Replit, Inc.

[** Test it on our Demo Space! **](https://huggingface.co/spaces/replit/replit-code-v1-3b-demo)

[** Fine-tuning and Instruct-tuning guides **](https://github.com/replit/replitLM)

[](#model-description)Model Description
---------------------------------------

`replit-code-v1-3b` is a 2.7B Causal Language Model focused on **Code Completion**. The model has been trained on a subset of the [Stack Dedup v1.2 dataset](https://arxiv.org/abs/2211.15533).

The training mixture includes **20 different languages**, listed here in descending order of number of tokens:  
`Markdown`, `Java`, `JavaScript`, `Python`, `TypeScript`, `PHP`, `SQL`, `JSX`, `reStructuredText`, `Rust`, `C`, `CSS`, `Go`, `C++`, `HTML`, `Vue`, `Ruby`, `Jupyter Notebook`, `R`, `Shell`  
In total, the training dataset contains 175B tokens, which were repeated over 3 epochs -- in total, `replit-code-v1-3b` has been trained on **525B** tokens (~195 tokens per parameter).

The model has been trained on the [MosaicML](https://www.mosaicml.com/) platform with 256 x A100-40GB GPUs, leveraging their latest [LLM examples repo](https://github.com/mosaicml/examples/tree/release/v0.0.4/examples/llm).  
`replit-code-v1-3b` is powered by state-of-the-art LLM techniques, such as: [Flash Attention](https://arxiv.org/abs/2205.14135) for fast training and inference, [AliBi positional embeddings](https://arxiv.org/abs/2108.12409) to support variable context length at inference time, [LionW optimizer](https://arxiv.org/abs/2302.06675), etc.

[](#intended-use)Intended Use
-----------------------------

Replit intends this model be used by anyone as a foundational model for application-specific fine-tuning without strict limitations on commercial use.

[](#limitations)Limitations
---------------------------

The pre-training dataset may have contained offensive or inappropriate content even after applying data cleansing filters, and such content may be reflected in model generated text. We recommend that users exercise reasonable caution when using in production systems. Do not use for any applications that may cause harm or distress to individuals or groups.

[](#license)License
-------------------

The model checkpoint and vocabulary file are licensed under the Creative Commons license (CC BY-SA-4.0). Under the license, you must give credit to Replit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests that Replit endorses you or your use.

The source code files (`*.py`) are licensed under the Apache 2.0 license.

[](#contact)Contact
-------------------

For questions and comments about the model, please post in the community section.

[](#how-to-use)How to Use
-------------------------

First of all, you need to install the latest versions of the following dependencies:

    einops
    sentencepiece
    torch
    transformers
    

You can then load the model as follows:

    from transformers import AutoModelForCausalLM
    
    # load model
    model = AutoModelForCausalLM.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)
    

To use the optimized Triton implementation of FlashAttention on GPUs with BF16 precision, first install the following dependencies:

    flash-attn==0.2.8
    triton==2.0.0.dev20221202
    

Then, move the model to `bfloat16` and use it as follows:

    from transformers import AutoModelForCausalLM, AutoConfig
    
    config = AutoConfig.from_pretrained(
        "replit/replit-code-v1-3b",
        trust_remote_code=True
    )
    config.attn_config['attn_impl'] = 'triton'
    
    # load model
    model = AutoModelForCausalLM.from_pretrained('replit/replit-code-v1-3b', config=config, trust_remote_code=True)
    model.to(device='cuda:0', dtype=torch.bfloat16)
    
    # forward pass
    x = torch.tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
    x = x.to(device='cuda:0')
    y = model(x)
    

Note that `trust_remote_code=True` is passed to the `from_pretrained` method because ReplitLM is not a class in the [Transformers](https://huggingface.co/docs/transformers/index) library.

### [](#tokenizer)Tokenizer

We have trained a custom SentencePiece Unigram tokenizer optimized with a vocabulary specifically for code of 32768 tokens.

Note that using this requires the `sentencepiece` library to be installed.

The tokenizer can be used as follows:

    from transformers import AutoTokenizer
    
    # load tokenizer
    tokenizer = AutoTokenizer.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)
    
    # single input encoding + generation
    x = tokenizer.encode('def hello():\n  print("hello world")\n', return_tensors='pt')
    y = model.generate(x)
    
    # decoding, clean_up_tokenization_spaces=False to ensure syntactical correctness
    generated_code = tokenizer.decode(y[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
    print(generated_code)
    

Note that:

*   `trust_remote_code=True` is passed to the `from_pretrained` method because ReplitLM is not a class in the [Transformers](https://huggingface.co/docs/transformers/index) library.
*   `clean_up_tokenization_spaces=False` is meant to avoid removing spaces in the output, because that would affect the syntactical correctness of the generated code.

### [](#generation)Generation

You can generate code using the `transformers` library as follows:

    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    tokenizer = AutoTokenizer.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)
    model = AutoModelForCausalLM.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)
    
    x = tokenizer.encode('def fibonacci(n): ', return_tensors='pt')
    y = model.generate(x, max_length=100, do_sample=True, top_p=0.95, top_k=4, temperature=0.2, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id)
    
    # decoding, clean_up_tokenization_spaces=False to ensure syntactical correctness
    generated_code = tokenizer.decode(y[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
    print(generated_code)
    

Experiment with different decoding methods and parameters to get the best results for your use case.

### [](#loading-with-8-bit-and-4-bit-quantization)Loading with 8-bit and 4-bit quantization

#### [](#loading-in-8-bit)Loading in 8-bit

You can also load the model in 8-bit with the `load_in_8bit=True` kwarg that uses `bitsandbytes` under the hood.

First you need to install the following additional dependanices:

    accelerate
    bitsandbytes
    

Then you can load the model in 8bit as follows:

    model = AutoModelForCausalLM.from_pretrained("replit/replit-code-v1-3b", 
                                                 trust_remote_code=True, 
                                                 device_map="auto",
                                                 load_in_8bit=True)
    

The additional kwargs that make this possible are `device_map='auto'` and `load_in_8bit=True`.

#### [](#loading-in-4-bit)Loading in 4-bit

For loading in 4-bit, at the time of writing, support for `load_in_4bit` has not been merged into the latest releases for `transformers` and `accelerate`. However you can use it if you install the dependancies the `main` branches of the published repos:

    pip install git+https://github.com/huggingface/accelerate.git
    pip install git+https://github.com/huggingface/transformers.git
    

Then load in 4-bit with:

    model = AutoModelForCausalLM.from_pretrained("replit/replit-code-v1-3b", 
                                                 trust_remote_code=True, 
                                                 device_map="auto",
                                                 load_in_4bit=True)
    

#### [](#references)References

*   [Hugging Face's Quantization Doc](https://huggingface.co/docs/transformers/main/main_classes/quantization)
*   [Original Blogpost introducing 8-bit](https://huggingface.co/blog/hf-bitsandbytes-integration)
*   [New Blogpost introducing 4-bit](https://huggingface.co/blog/4bit-transformers-bitsandbytes)

### [](#post-processing)Post Processing

Note that as with all code generation models, post-processing of the generated code is important. In particular, the following post-processing steps are recommended:

*   stop generation when the EOS token is encountered
*   remove trailing whitespaces
*   set `max_tokens` to a reasonable value based on your completion use case
*   truncate generation to stop words such as `return`, `def`, "\`\`\`", "`\n\n\n`" to avoid generating incomplete code when `max_tokens`is larger than the length of the expected generated code.

## Model overview

`replit-code-v1-3b` is a 2.7B Causal Language Model developed by [Replit](https://aimodels.fyi/creators/huggingFace/replit) that is focused on code completion. It has been trained on a diverse dataset of 20 programming languages, including Markdown, Java, JavaScript, Python, and more, totaling 525B tokens. Compared to similar models like [StarCoder](https://aimodels.fyi/models/huggingFace/starcoder-bigcode) and [rebel-large](https://aimodels.fyi/models/huggingFace/rebel-large-babelscape), `replit-code-v1-3b` is tailored specifically for code generation tasks.

## Model inputs and outputs

`replit-code-v1-3b` takes text input and generates text output, with a focus on producing code snippets. The model utilizes advanced techniques like Flash Attention and AliBi positional embeddings to enable efficient training and inference on long input sequences.

### Inputs
- Text prompts, which can include a mix of natural language and code

### Outputs
- Autoregressive text generation, with a focus on producing valid and relevant code snippets
- The model can generate multi-line code outputs

## Capabilities

`replit-code-v1-3b` excels at code completion tasks, where it can generate relevant and functional code to extend or complete a given programming snippet. It has been trained on a diverse set of languages, allowing it to handle a wide range of coding tasks.

## What can I use it for?

The `replit-code-v1-3b` model is well-suited for applications that involve code generation or assistance, such as:

- Integrated development environment (IDE) plugins that provide intelligent code completion
- Automated code generation tools for rapid prototyping or boilerplate creation
- Educational or learning platforms that help users learn to code by providing helpful suggestions

## Things to try

One interesting thing to try with `replit-code-v1-3b` is to provide it with a partial code snippet and see how it can complete or extend the code. You could also experiment with providing the model with a natural language description of a programming task and see if it can generate the corresponding code.