[](#phind-codellama-34b-python-v1)**Phind-CodeLlama-34B-Python-v1**
===================================================================

We've fine-tuned CodeLlama-34B and CodeLlama-34B-Python on an internal Phind dataset that achieve 67.6% and 69.5% pass@1 on HumanEval, respectively. GPT-4 achieves 67%. We've applied OpenAI's decontamination methodology to our dataset to ensure result validity.

More details can be found on our [blog post](https://www.phind.com/blog/code-llama-beats-gpt4).

[](#model-details)Model Details
-------------------------------

This model is fine-tuned from CodeLlama-34B-Python and achieves 69.5% pass@1 on HumanEval.

[](#dataset-details)Dataset Details
-----------------------------------

We fined-tuned on a proprietary dataset of ~80k high quality programming problems and solutions. This dataset consists of instruction-answer pairs instead of code completion examples, making it structurally different from HumanEval. The Phind models were trained for 2 epochs, for a total of ~160k examples shown. LoRA was not used -- both models are a native finetune. We used DeepSpeed ZeRO 3 and Flash Attention 2 to train these models in three hours on 32 A100-80GB GPUs. We used a sequence length of 4096 tokens.

[](#how-to-get-started-with-the-model)How to Get Started with the Model
-----------------------------------------------------------------------

Make sure to install Transformers from the main git branch:

    pip install git+https://github.com/huggingface/transformers.git
    

[](#how-to-prompt-the-model)How to Prompt the Model
---------------------------------------------------

**Please note that this model is somewhat instruction-tuned, but not chat-tuned.**

Do not try to use the Llama chat markup with this model. Instead, simply tell it what you want and add "\\n: " at the end of your task.

For example:

    Write me a linked list implementation: \n
    

[](#how-to-reproduce-humaneval-results)How to reproduce HumanEval Results
-------------------------------------------------------------------------

To reproduce our results:

    
    from transformers import AutoTokenizer, LlamaForCausalLM
    from human_eval.data import write_jsonl, read_problems
    from tqdm import tqdm
    
    # initialize the model
    
    model_path = "Phind/Phind-CodeLlama-34B-v1"
    model = LlamaForCausalLM.from_pretrained(model_path, device_map="auto")
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    
    # HumanEval helper
    
    def generate_one_completion(prompt: str):
        tokenizer.pad_token = tokenizer.eos_token
        inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=4096)
    
        # Generate
        generate_ids = model.generate(inputs.input_ids.to("cuda"), max_new_tokens=256, do_sample=True, top_p=0.75, top_k=40, temperature=0.1)
        completion = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
        completion = completion.replace(prompt, "").split("\n\n\n")[0]
    
        return completion
    
    # perform HumanEval
    problems = read_problems()
    
    num_samples_per_task = 1
    samples = [
        dict(task_id=task_id, completion=generate_one_completion(problems[task_id]["prompt"]))
        for task_id in tqdm(problems)
        for _ in range(num_samples_per_task)
    ]
    write_jsonl("samples.jsonl", samples)
    
    # run `evaluate_functional_correctness samples.jsonl` in your HumanEval code sandbox
    

[](#bias-risks-and-limitations)Bias, Risks, and Limitations
-----------------------------------------------------------

This model has undergone very limited testing. Additional safety testing should be performed before any real-world deployments.

[](#training-details)Training details
-------------------------------------

*   **Hardware Type:** 32x A100-80GB
*   **Hours used:** 90 GPU-hours
*   **Cloud Provider:** AWS
*   **Compute Region:** us-east-1

## Model overview

The `Phind-CodeLlama-34B-v1` model is a fine-tuned version of the CodeLlama-34B and CodeLlama-34B-Python models, achieving 67.6% and 69.5% pass@1 on the HumanEval benchmark respectively. This exceeds the performance of GPT-4, which achieves 67% on the same benchmark. The model was fine-tuned by [Phind](https://aimodels.fyi/creators/huggingFace/Phind) on a proprietary dataset of 80k high-quality programming problems and solutions, using decontamination techniques to ensure the validity of the results.

## Model inputs and outputs

This model is a text-to-text AI assistant, taking in user prompts and generating relevant text responses. It is somewhat instruction-tuned, but not fully chat-tuned, so users should avoid using the Llama chat markup and instead simply provide their task or request followed by `\n: `.

### Inputs
- User prompts or instructions for the model, such as "Write me a linked list implementation:\n"

### Outputs
- Textual responses from the model, such as a linked list implementation in code form.

## Capabilities

The `Phind-CodeLlama-34B-v1` model is a capable code generation and understanding model, excelling at tasks like code completion, infilling, and following programming instructions. It has been trained to be proficient in Python, as well as other programming languages like C/C++, TypeScript, and Java.

## What can I use it for?

This model could be useful for a variety of software development and programming tasks, such as:

- Generating boilerplate code or code snippets
- Assisting with programming problem-solving and debugging
- Translating between different programming languages
- Automating repetitive coding tasks

However, as the [Phind](https://aimodels.fyi/creators/huggingFace/Phind) team notes, the model has undergone limited testing and additional safety measures should be taken before deploying it in real-world applications.

## Things to try

One interesting aspect of this model is its use of instruction-tuning rather than traditional chat-based prompting. This makes it better suited for task-oriented interactions, where the user provides a clear request or instruction, rather than open-ended conversations. Experiment with providing the model with concise, well-defined programming tasks and see how it responds.