## Model Overview

`Qwen-7B-Chat` is a large language model developed by Qwen, a team from Alibaba Cloud. It is a transformer-based model that has been pretrained on a large volume of data including web texts, books, and code. `Qwen-7B-Chat` is an aligned version of the `Qwen-7B` model, trained using techniques to improve the model's conversational abilities.

Compared to similar models like [Baichuan-7B](https://aimodels.fyi/models/huggingFace/baichuan-7b-baichuan-inc), `Qwen-7B-Chat` leverages the Qwen model series which has been optimized for both Chinese and English. The model achieves strong performance on standard benchmarks like C-EVAL and MMLU. Unlike LLaMA, which prohibits commercial use, `Qwen-7B-Chat` has a more permissive open-source license that allows for commercial applications.

## Model Inputs and Outputs

### Inputs
- **Text prompts**: `Qwen-7B-Chat` accepts text prompts as input, which can be used to initiate conversations or provide instructions for the model.

### Outputs
- **Text responses**: The model generates coherent and contextually relevant text responses based on the input prompts. The responses aim to be informative, engaging, and helpful for the user.

## Capabilities

`Qwen-7B-Chat` demonstrates strong performance across a variety of natural language tasks, including open-ended conversations, question answering, summarization, and even code generation. The model can engage in multi-turn dialogues, maintain context, and provide detailed and thoughtful responses.

For example, when prompted with "Tell me about the history of the internet", `Qwen-7B-Chat` is able to provide a comprehensive overview covering the key developments and milestones in the history of the internet, drawing upon its broad knowledge base.

## What Can I Use It For?

`Qwen-7B-Chat` can be a valuable tool for a wide range of applications, including:

- **Conversational AI assistants**: The model's strong conversational abilities make it well-suited for building engaging and intelligent virtual assistants that can help with a variety of tasks.
- **Content generation**: `Qwen-7B-Chat` can be used to generate high-quality text content, such as articles, stories, or even marketing copy, by providing relevant prompts.
- **Chatbots and customer service**: The model's ability to understand and respond to natural language queries makes it a good fit for building chatbots and virtual customer service agents.
- **Educational applications**: `Qwen-7B-Chat` can be used to create interactive learning experiences, answer questions, and provide explanations on a variety of topics.

## Things to Try

One interesting aspect of `Qwen-7B-Chat` is its ability to engage in open-ended conversations and provide detailed, contextually relevant responses. For example, try prompting the model with a more abstract or philosophical question, such as "What is the meaning of life?" or "How can we achieve true happiness?" The model's responses can provide interesting insights and perspectives, showcasing its depth of understanding and reasoning capabilities.

Another area to explore is the model's ability to handle complex tasks, such as providing step-by-step instructions for a multi-part process or generating coherent and logical code snippets. By testing the model's capabilities in these more challenging areas, you can gain a better understanding of its strengths and limitations.

[](#qwen2-72b-instruct)Qwen2-72B-Instruct
=========================================

[](#introduction)Introduction
-----------------------------

Qwen2 is the new series of Qwen large language models. For Qwen2, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters, including a Mixture-of-Experts model. This repo contains the instruction-tuned 72B Qwen2 model.

Compared with the state-of-the-art opensource language models, including the previous released Qwen1.5, Qwen2 has generally surpassed most opensource models and demonstrated competitiveness against proprietary models across a series of benchmarks targeting for language understanding, language generation, multilingual capability, coding, mathematics, reasoning, etc.

Qwen2-72B-Instruct supports a context length of up to 131,072 tokens, enabling the processing of extensive inputs. Please refer to [this section](#processing-long-texts) for detailed instructions on how to deploy Qwen2 for handling long texts.

For more details, please refer to our [blog](https://qwenlm.github.io/blog/qwen2/), [GitHub](https://github.com/QwenLM/Qwen2), and [Documentation](https://qwen.readthedocs.io/en/latest/).  

[](#model-details)Model Details
-------------------------------

Qwen2 is a language model series including decoder language models of different model sizes. For each size, we release the base language model and the aligned chat model. It is based on the Transformer architecture with SwiGLU activation, attention QKV bias, group query attention, etc. Additionally, we have an improved tokenizer adaptive to multiple natural languages and codes.

[](#training-details)Training details
-------------------------------------

We pretrained the models with a large amount of data, and we post-trained the models with both supervised finetuning and direct preference optimization.

[](#requirements)Requirements
-----------------------------

The code of Qwen2 has been in the latest Hugging face transformers and we advise you to install `transformers>=4.37.0`, or you might encounter the following error:

    KeyError: 'qwen2'
    

[](#quickstart)Quickstart
-------------------------

Here provides a code snippet with `apply_chat_template` to show you how to load the tokenizer and model and how to generate contents.

    from transformers import AutoModelForCausalLM, AutoTokenizer
    device = "cuda" # the device to load the model onto
    
    model = AutoModelForCausalLM.from_pretrained(
        "Qwen/Qwen2-72B-Instruct",
        torch_dtype="auto",
        device_map="auto"
    )
    tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-72B-Instruct")
    
    prompt = "Give me a short introduction to large language model."
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(device)
    
    generated_ids = model.generate(
        model_inputs.input_ids,
        max_new_tokens=512
    )
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]
    
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    

### [](#processing-long-texts)Processing Long Texts

To handle extensive inputs exceeding 32,768 tokens, we utilize [YARN](https://arxiv.org/abs/2309.00071), a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts.

For deployment, we recommend using vLLM. You can enable the long-context capabilities by following these steps:

1.  **Install vLLM**: You can install vLLM by running the following command.

    pip install "vllm>=0.4.3"
    

Or you can install vLLM from [source](https://github.com/vllm-project/vllm/).

2.  **Configure Model Settings**: After downloading the model weights, modify the `config.json` file by including the below snippet:
    
            {
                "architectures": [
                    "Qwen2ForCausalLM"
                ],
                // ...
                "vocab_size": 152064,
        
                // adding the following snippets
                "rope_scaling": {
                    "factor": 4.0,
                    "original_max_position_embeddings": 32768,
                    "type": "yarn"
                }
            }
        
    
    This snippet enable YARN to support longer contexts.
    
3.  **Model Deployment**: Utilize vLLM to deploy your model. For instance, you can set up an openAI-like server using the command:
    
        python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-72B-Instruct --model path/to/weights
        
    
    Then you can access the Chat API by:
    
        curl http://localhost:8000/v1/chat/completions \
            -H "Content-Type: application/json" \
            -d '{
            "model": "Qwen2-72B-Instruct",
            "messages": [
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": "Your Long Input Here."}
            ]
            }'
        
    
    For further usage instructions of vLLM, please refer to our [Github](https://github.com/QwenLM/Qwen2).
    

**Note**: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts**. We advise adding the `rope_scaling` configuration only when processing long contexts is required.

[](#evaluation)Evaluation
-------------------------

We briefly compare Qwen2-72B-Instruct with similar-sized instruction-tuned LLMs, including our previous Qwen1.5-72B-Chat. The results are shown as follows:

Datasets

Llama-3-70B-Instruct

Qwen1.5-72B-Chat

**Qwen2-72B-Instruct**

_**English**_

MMLU

82.0

75.6

**82.3**

MMLU-Pro

56.2

51.7

**64.4**

GPQA

41.9

39.4

**42.4**

TheroemQA

42.5

28.8

**44.4**

MT-Bench

8.95

8.61

**9.12**

Arena-Hard

41.1

36.1

**48.1**

IFEval (Prompt Strict-Acc.)

77.3

55.8

**77.6**

_**Coding**_

HumanEval

81.7

71.3

**86.0**

MBPP

**82.3**

71.9

80.2

MultiPL-E

63.4

48.1

**69.2**

EvalPlus

75.2

66.9

**79.0**

LiveCodeBench

29.3

17.9

**35.7**

_**Mathematics**_

GSM8K

**93.0**

82.7

91.1

MATH

50.4

42.5

**59.7**

_**Chinese**_

C-Eval

61.6

76.1

**83.8**

AlignBench

7.42

7.28

**8.27**

[](#citation)Citation
---------------------

If you find our work helpful, feel free to give us a cite.

    @article{qwen2,
      title={Qwen2 Technical Report},
      year={2024}
    }

## Model overview

`Qwen2-72B-Instruct` is the 72 billion parameter version of the Qwen2 series of large language models developed by Qwen. Compared to the state-of-the-art open-source language models, including the previous Qwen1.5 release, Qwen2 has generally surpassed most open-source models and demonstrated competitiveness against proprietary models across a range of benchmarks targeting language understanding, generation, multilingual capability, coding, mathematics, and reasoning. The Qwen2-72B-Instruct model specifically has been instruction-tuned, enabling it to excel at a variety of tasks. 

The Qwen2 series, including the [Qwen2-7B-Instruct](https://aimodels.fyi/models/huggingFace/qwen2-7b-instruct-qwen) and [Qwen2-72B](https://aimodels.fyi/models/huggingFace/qwen2-72b-qwen) models, is based on the Transformer architecture with improvements like SwiGLU activation, attention QKV bias, and group query attention. Qwen has also developed an improved tokenizer that is adaptive to multiple natural languages and codes.

## Model inputs and outputs

### Inputs
- Text prompts for language generation, translation, summarization, and other language tasks

### Outputs
- Texts generated in response to the input prompts, with the model demonstrating strong performance on a variety of natural language processing tasks.

## Capabilities

The Qwen2-72B-Instruct model has shown strong performance on a range of benchmarks, including language understanding, generation, multilingual capability, coding, mathematics, and reasoning. For example, it surpassed open-source models like LLaMA and Yi on the MMLU (Multimodal Language Understanding) benchmark, and outperformed them on coding tasks like HumanEval and MultiPL-E. The model also exhibited competitive performance against proprietary models like ChatGPT on Chinese language benchmarks like C-Eval.

## What can I use it for?

The Qwen2-72B-Instruct model can be used for a variety of natural language processing tasks, including text generation, language translation, summarization, and question answering. Its strong performance on coding and mathematical reasoning benchmarks also makes it suitable for applications like code generation and problem-solving. Given its multilingual capabilities, the model can be leveraged for international and cross-cultural projects.

## Things to try

One interesting aspect of the Qwen2-72B-Instruct model is its ability to handle long input texts. By utilizing the YARN technique for enhancing model length extrapolation, the model can process inputs up to 131,072 tokens, enabling the processing of extensive texts. This could be useful for applications that require working with large amounts of textual data, such as document summarization or question answering over lengthy passages.

[](#qwen-14b-chat)Qwen-14B-Chat
===============================

![](https://qianwen-res.oss-cn-beijing.aliyuncs.com/logo_qwen.jpg)

  

 [Hugging Face](https://huggingface.co/Qwen) |  [ModelScope](https://modelscope.cn/organization/qwen) |   [Paper](https://arxiv.org/abs/2309.16609)    [Demo](https://modelscope.cn/studios/qwen/Qwen-14B-Chat-Demo/summary)  
[WeChat ()](https://github.com/QwenLM/Qwen/blob/main/assets/wechat.png) | [Discord](https://discord.gg/z3GAxXZ9Ce)  [API](https://dashscope.aliyun.com)

  

[](#introduction)Introduction
-------------------------------------

**-14BQwen-14B**140Qwen-14BTransformer, Qwen-14BAIQwen-14B-ChatQwen-14B-Chat

-14B[GitHub](https://github.com/QwenLM/Qwen)

**Qwen-14B** is the 14B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-14B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-14B, we release Qwen-14B-Chat, a large-model-based AI assistant, which is trained with alignment techniques. This repository is the one for Qwen-14B-Chat.

For more details about the open-source model of Qwen-14B, please refer to the [GitHub](https://github.com/QwenLM/Qwen) code repository.  

[](#requirements)Requirements
-------------------------------------

*   python 3.8
*   pytorch 1.122.0
*   CUDA 11.4GPUflash-attention
*   python 3.8 and above
*   pytorch 1.12 and above, 2.0 and above are recommended
*   CUDA 11.4 and above are recommended (this is for GPU users, flash-attention users, etc.)  
    

[](#dependency)Dependency
-----------------------------------

Qwen-14B-Chatpip

To run Qwen-14B-Chat, please make sure you meet the above requirements, and then execute the following pip commands to install the dependent libraries.

    pip install transformers==4.32.0 accelerate tiktoken einops scipy transformers_stream_generator==0.0.4 peft deepspeed
    

`flash-attention`**flash attention 2**

In addition, it is recommended to install the `flash-attention` library (**we support flash attention 2 now.**) for higher efficiency and lower memory usage.

    git clone https://github.com/Dao-AILab/flash-attention
    cd flash-attention && pip install .
    # 
    # pip install csrc/layer_norm
    # pip install csrc/rotary
    

  

[](#quickstart)Quickstart
-------------------------------------

Qwen-14B-Chat

We show an example of multi-turn interaction with Qwen-14B-Chat in the following code:

    from transformers import AutoModelForCausalLM, AutoTokenizer
    from transformers.generation import GenerationConfig
    
    # Note: The default behavior now has injection attack prevention off.
    tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-14B-Chat", trust_remote_code=True)
    
    # use bf16
    # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-14B-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
    # use fp16
    # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-14B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
    # use cpu only
    # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-14B-Chat", device_map="cpu", trust_remote_code=True).eval()
    # use auto mode, automatically select precision based on the device.
    model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-14B-Chat", device_map="auto", trust_remote_code=True).eval()
    
    # Specify hyperparameters for generation. But if you use transformers>=4.32.0, there is no need to do this.
    # model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-14B-Chat", trust_remote_code=True) # top_p
    
    #  1st dialogue turn
    response, history = model.chat(tokenizer, "", history=None)
    print(response)
    # 
    
    #  2nd dialogue turn
    response, history = model.chat(tokenizer, "", history=history)
    print(response)
    # 
    # 
    # 
    # 
    # 
    # 
    
    #  3rd dialogue turn
    response, history = model.chat(tokenizer, "", history=history)
    print(response)
    # 
    

[GitHub repo](https://github.com/QwenLM/Qwen)

For more information, please refer to our [GitHub repo](https://github.com/QwenLM/Qwen) for more information.  

[](#-quantization) (Quantization)
-------------------------------------

### [](#-usage) (Usage)

**[AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)Qwen-14B-ChatInt4[](https://huggingface.co/Qwen/Qwen-14B-Chat-Int4)**

**Note: we provide a new solution based on [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), and release an Int4 quantized model for Qwen-14B-Chat [Click here](https://huggingface.co/Qwen/Qwen-14B-Chat-Int4), which achieves nearly lossless model effects but improved performance on both memory costs and inference speed, in comparison with the previous solution.**

Int4torch 2.0transformers4.32.0

Here we demonstrate how to use our provided quantized models for inference. Before you start, make sure you meet the requirements of auto-gptq (e.g., torch 2.0 and above, transformers 4.32.0 and above, etc.) and install the required packages:

    pip install auto-gptq optimum
    

`auto-gptq`[repo](https://github.com/PanQiWei/AutoGPTQ)wheel



If you meet problems installing `auto-gptq`, we advise you to check out the official [repo](https://github.com/PanQiWei/AutoGPTQ) to find a pre-build wheel.

Then you can load the quantized model easily and run inference as same as usual:

    model = AutoModelForCausalLM.from_pretrained(
        "Qwen/Qwen-14B-Chat-Int4",
        device_map="auto",
        trust_remote_code=True
    ).eval()
    response, history = model.chat(tokenizer, "", history=None)
    

### [](#)

BF16Int8Int4zero-shot

We illustrate the zero-shot performance of both BF16, Int8 and Int4 models on the benchmark, and we find that the quantized model does not suffer from significant performance degradation. Results are shown below:

Quantization

MMLU

CEval (val)

GSM8K

Humaneval

BF16

64.6

69.8

60.1

43.9

Int8

63.6

68.6

60.0

48.2

Int4

63.3

69.0

59.8

45.7

### [](#-inference-speed) (Inference Speed)

FlashAttn20488192token

We measured the average inference speed of generating 2048 and 8192 tokens with different quantization levels and versions of flash-attention, respectively.

Quantization

FlashAttn

Speed (2048 tokens)

Speed (8192 tokens)

BF16

v2

32.88

24.87

Int8

v2

29.28

24.22

Int4

v2

38.72

27.33

BF16

v1

32.76

28.89

Int8

v1

28.31

23.87

Int4

v1

37.81

26.46

BF16

Disabled

29.32

22.91

Int8

Disabled

31.12

24.60

Int4

Disabled

37.65

26.00

18192tokenA100-SXM4-80G GPUPyTorch 2.0.1CUDA 11.88192token

In detail, the setting of profiling is generating 8192 new tokens with 1 context token. The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.8. The inference speed is averaged over the generated 8192 tokens.

Int4/Int8autogptq`AutoModelForCausalLM.from_pretrained`20%HuggingFace

Note: The generation speed of the Int4/Int8 models mentioned above is provided by the autogptq library. The current speed of the model loaded using "AutoModelForCausalLM.from\_pretrained" will be approximately 20% slower. We have reported this issue to the HuggingFace team and will update it promptly if a solution is available.

### [](#-gpu-memory-usage) (GPU Memory Usage)

2048token8192tokenFlashAttn

We also profile the peak GPU memory usage for encoding 2048 tokens as context (and generating single token) and generating 8192 tokens (with single token as context) under different quantization levels, respectively. The GPU memory usage is similar when using flash-attention or not.The results are shown below.

Quantization Level

Peak Usage for Encoding 2048 Tokens

Peak Usage for Generating 8192 Tokens

BF16

30.15GB

38.94GB

Int8

18.81GB

27.54GB

Int4

13.01GB

21.79GB

[](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py)

The above speed and memory profiling are conducted using [this script](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py).  

[](#model)Model
---------------------------

Qwen-14BQwen-14B-Chat

The details of the model architecture of Qwen-14B-Chat are listed as follows

Hyperparameter

Value

n\_layers

40

n\_heads

40

d\_model

5120

vocab size

151851

sequence length

2048

FFNnormalization RoPESwiGLURMSNormflash-attention

Qwen-14B-Chat15token GPT-4BPE`cl100k_base` [tiktoken](https://github.com/openai/tiktoken)

For position encoding, FFN activation function, and normalization calculation methods, we adopt the prevalent practices, i.e., RoPE relative position encoding, SwiGLU for activation function, and RMSNorm for normalization (optional installation of flash-attention for acceleration).

For tokenization, compared to the current mainstream open-source models based on Chinese and English vocabularies, Qwen-14B-Chat uses a vocabulary of over 150K tokens. It first considers efficient encoding of Chinese, English, and code data, and is also more friendly to multilingual languages, enabling users to directly enhance the capability of some languages without expanding the vocabulary. It segments numbers by single digit, and calls the [tiktoken](https://github.com/openai/tiktoken) tokenizer library for efficient tokenization.  

[](#evaluation)Evaluation
-------------------------------------

Qwen-14B-ChatC-EvalMMLUHumanEvalGSM8KQwen-14B-Chat



For Qwen-14B-Chat, we also evaluate the model on C-Eval, MMLU, HumanEval, GSM8K, etc., as well as the benchmark evaluation for long-context understanding, and tool usage.

Note: Due to rounding errors caused by hardware and framework, differences in reproduced results are possible.

### [](#chinese-evaluation)Chinese Evaluation

#### [](#c-eval)C-Eval

[C-Eval](https://arxiv.org/abs/2305.08322)Qwen-14B-Chat0-shot & 5-shot

We demonstrate the 0-shot & 5-shot accuracy of Qwen-14B-Chat on C-Eval validation set

Model

Avg. Acc.

LLaMA2-7B-Chat

31.9

LLaMA2-13B-Chat

36.2

LLaMA2-70B-Chat

44.3

ChatGLM2-6B-Chat

52.6

InternLM-7B-Chat

53.6

Baichuan2-7B-Chat

55.6

Baichuan2-13B-Chat

56.7

Qwen-7B-Chat (original) (0-shot)

54.2

**Qwen-7B-Chat (0-shot)**

59.7

**Qwen-7B-Chat (5-shot)**

59.3

**Qwen-14B-Chat (0-shot)**

69.8

**Qwen-14B-Chat (5-shot)**

**71.7**

C-EvalQwen-14B-Chatzero-shot

The zero-shot accuracy of Qwen-14B-Chat on C-Eval testing set is provided below:

Model

Avg.

STEM

Social Sciences

Humanities

Others

Chinese-Alpaca-Plus-13B

41.5

36.6

49.7

43.1

41.2

Chinese-Alpaca-2-7B

40.3

\-

\-

\-

\-

ChatGLM2-6B-Chat

50.1

46.4

60.4

50.6

46.9

Baichuan-13B-Chat

51.5

43.7

64.6

56.2

49.2

Qwen-7B-Chat (original)

54.6

47.8

67.6

59.3

50.6

**Qwen-7B-Chat**

58.6

53.3

72.1

62.8

52.0

**Qwen-14B-Chat**

**69.1**

65.1

80.9

71.2

63.4

14BQwen-14B-Chat

Compared with other pretrained models with comparable model size, the human-aligned Qwen-14B-Chat performs well in C-Eval accuracy.

### [](#english-evaluation)English Evaluation

#### [](#mmlu)MMLU

[MMLU](https://arxiv.org/abs/2009.03300)Qwen-14B-Chat 0-shot & 5-shot 

The 0-shot & 5-shot accuracy of Qwen-14B-Chat on MMLU is provided below. The performance of Qwen-14B-Chat still on the top between other human-aligned models with comparable size.

Model

Avg. Acc.

ChatGLM2-6B-Chat

46.0

LLaMA2-7B-Chat

46.2

InternLM-7B-Chat

51.1

Baichuan2-7B-Chat

52.9

LLaMA2-13B-Chat

54.6

Baichuan2-13B-Chat

57.3

LLaMA2-70B-Chat

63.8

Qwen-7B-Chat (original) (0-shot)

53.9

**Qwen-7B-Chat (0-shot)**

55.8

**Qwen-7B-Chat (5-shot)**

57.0

**Qwen-14B-Chat (0-shot)**

64.6

**Qwen-14B-Chat (5-shot)**

**66.5**

### [](#coding-evaluation)Coding Evaluation

Qwen-14B-Chat[HumanEval](https://github.com/openai/human-eval)zero-shot Pass@1

The zero-shot Pass@1 of Qwen-14B-Chat on [HumanEval](https://github.com/openai/human-eval) is demonstrated below

Model

Pass@1

ChatGLM2-6B-Chat

11.0

LLaMA2-7B-Chat

12.2

InternLM-7B-Chat

14.6

Baichuan2-7B-Chat

13.4

LLaMA2-13B-Chat

18.9

Baichuan2-13B-Chat

17.7

LLaMA2-70B-Chat

32.3

Qwen-7B-Chat (original)

24.4

**Qwen-7B-Chat**

37.2

**Qwen-14B-Chat**

**43.9**

### [](#mathematics-evaluation)Mathematics Evaluation

[GSM8K](https://github.com/openai/grade-school-math)Qwen-14B-Chat

The accuracy of Qwen-14B-Chat on GSM8K is shown below

Model

Acc.

LLaMA2-7B-Chat

26.3

ChatGLM2-6B-Chat

28.8

Baichuan2-7B-Chat

32.8

InternLM-7B-Chat

33.0

LLaMA2-13B-Chat

37.1

Baichuan2-13B-Chat

55.3

LLaMA2-70B-Chat

59.3

Qwen-7B-Chat (original) (0-shot)

41.1

**Qwen-7B-Chat (0-shot)**

50.3

**Qwen-7B-Chat (8-shot)**

54.1

**Qwen-14B-Chat (0-shot)**

**60.1**

**Qwen-14B-Chat (8-shot)**

59.3

### [](#long-context-understanding)Long-Context Understanding

NTKLogNQwen-14B-Chat[VCSUM](https://arxiv.org/abs/2305.05280)15KQwen-14B-ChatRouge-L

**(config.json`use_dynamic_ntk``use_logn_attn`true)**

We introduce NTK-aware interpolation, LogN attention scaling to extend the context length of Qwen-14B-Chat. The Rouge-L results of Qwen-14B-Chat on long-text summarization dataset [VCSUM](https://arxiv.org/abs/2305.05280) (The average length of this dataset is around 15K) are shown below:

**(To use these tricks, please set `use_dynamic_ntk` and `use_long_attn` to true in config.json.)**

Model

VCSUM (zh)

GPT-3.5-Turbo-16k

16.0

LLama2-7B-Chat

0.2

InternLM-7B-Chat

13.0

ChatGLM2-6B-Chat

16.3

**Qwen-14B-Chat**

**17.3**

### [](#tool-usage)Tool Usage

#### [](#react-prompting)ReAct Prompting

 [ReAct Prompting](https://arxiv.org/abs/2210.03629) //APIReAct  [LangChain](https://python.langchain.com/) 

Qwen-Chat supports calling plugins/tools/APIs through [ReAct Prompting](https://arxiv.org/abs/2210.03629). ReAct is also one of the main approaches used by the [LangChain](https://python.langchain.com/) framework. In our evaluation benchmark for assessing tool usage capabilities, Qwen-Chat's performance is as follows:

Chinese Tool-Use Benchmark

Model

Tool Selection (Acc.)

Tool Input (Rouge-L)

False Positive Error

GPT-4

95%

0.90

15.0%

GPT-3.5

85%

0.88

75.0%

Qwen-7B-Chat

98%

0.91

7.3%

Qwen-14B-Chat

98%

0.93

2.4%

> False Positive

> The plugins that appear in the evaluation set do not appear in the training set of Qwen. This benchmark evaluates the accuracy of the model in selecting the correct plugin from multiple candidate plugins, the rationality of the parameters passed into the plugin, and the false positive rate. False Positive: Incorrectly invoking a plugin when it should not have been called when responding to a query.

[![](/Qwen/Qwen-14B-Chat/resolve/main/assets/react_showcase_001.png)](/Qwen/Qwen-14B-Chat/blob/main/assets/react_showcase_001.png) [![](/Qwen/Qwen-14B-Chat/resolve/main/assets/react_showcase_002.png)](/Qwen/Qwen-14B-Chat/blob/main/assets/react_showcase_002.png)

#### [](#code-interpreter)Code Interpreter

QwenPython Code Interpreter[](https://github.com/QwenLM/Qwen-Agent/tree/main/benchmark)

Qwen

To assess Qwen's ability to use the Python Code Interpreter for tasks such as mathematical problem solving, data visualization, and other general-purpose tasks such as file handling and web scraping, we have created and open-sourced a benchmark specifically designed for evaluating these capabilities. You can find the benchmark at this [link](https://github.com/QwenLM/Qwen-Agent/tree/main/benchmark).

We have observed that Qwen performs well in terms of code executability and result accuracy when generating code:

Executable Rate of Generated Code (%)

Model

Math

Visualization

General

GPT-4

91.9

85.9

82.8

GPT-3.5

89.2

65.0

74.1

LLaMA2-7B-Chat

41.9

33.1

24.1

LLaMA2-13B-Chat

50.0

40.5

48.3

CodeLLaMA-7B-Instruct

85.1

54.0

70.7

CodeLLaMA-13B-Instruct

93.2

55.8

74.1

InternLM-7B-Chat-v1.1

78.4

44.2

62.1

InternLM-20B-Chat

70.3

44.2

65.5

Qwen-7B-Chat

82.4

64.4

67.2

Qwen-14B-Chat

89.2

84.1

65.5

Accuracy of Code Execution Results (%)

Model

Math

Visualization-Hard

Visualization-Easy

GPT-4

82.8

66.7

60.8

GPT-3.5

47.3

33.3

55.7

LLaMA2-7B-Chat

3.9

14.3

39.2

LLaMA2-13B-Chat

8.3

8.3

40.5

CodeLLaMA-7B-Instruct

14.3

26.2

60.8

CodeLLaMA-13B-Instruct

28.2

27.4

62.0

InternLM-7B-Chat-v1.1

28.5

4.8

40.5

InternLM-20B-Chat

34.6

21.4

45.6

Qwen-7B-Chat

41.9

40.5

54.4

Qwen-14B-Chat

58.4

53.6

59.5

  
![](/Qwen/Qwen-14B-Chat/resolve/main/assets/code_interpreter_showcase_001.jpg)  

#### [](#huggingface-agent)Huggingface Agent

 [HuggingFace Agent](https://huggingface.co/docs/transformers/transformers_agents)  Huggingface run

Qwen-Chat also has the capability to be used as a [HuggingFace Agent](https://huggingface.co/docs/transformers/transformers_agents). Its performance on the run-mode benchmark provided by HuggingFace is as follows:

HuggingFace Agent Benchmark- Run Mode

Model

Tool Selection

Tool Used

Code

GPT-4

100

100

97.4

GPT-3.5

95.4

96.3

87.0

StarCoder-Base-15B

86.1

87.0

68.9

StarCoder-15B

87.0

88.0

68.9

Qwen-7B-Chat

87.0

87.0

71.5

Qwen-14B-Chat

93.5

94.4

87.0

HuggingFace Agent Benchmark - Chat Mode

Model

Tool Selection

Tool Used

Code

GPT-4

97.9

97.9

98.5

GPT-3.5

97.3

96.8

89.6

StarCoder-Base-15B

97.9

97.9

91.1

StarCoder-15B

97.9

97.9

89.6

Qwen-7B-Chat

94.7

94.7

85.1

Qwen-14B-Chat

97.9

97.9

95.5

  

[](#faq)FAQ
-----------

[FAQ](https://github.com/QwenLM/Qwen/blob/main/FAQ_zh.md)issueissue

If you meet problems, please refer to [FAQ](https://github.com/QwenLM/Qwen/blob/main/FAQ.md) and the issues first to search a solution before you launch a new issue.  

[](#-citation) (Citation)
-----------------------------



If you find our work helpful, feel free to give us a cite.

    @article{qwen,
      title={Qwen Technical Report},
      author={Jinze Bai and Shuai Bai and Yunfei Chu and Zeyu Cui and Kai Dang and Xiaodong Deng and Yang Fan and Wenbin Ge and Yu Han and Fei Huang and Binyuan Hui and Luo Ji and Mei Li and Junyang Lin and Runji Lin and Dayiheng Liu and Gao Liu and Chengqiang Lu and Keming Lu and Jianxin Ma and Rui Men and Xingzhang Ren and Xuancheng Ren and Chuanqi Tan and Sinan Tan and Jianhong Tu and Peng Wang and Shijie Wang and Wei Wang and Shengguang Wu and Benfeng Xu and Jin Xu and An Yang and Hao Yang and Jian Yang and Shusheng Yang and Yang Yao and Bowen Yu and Hongyi Yuan and Zheng Yuan and Jianwei Zhang and Xingxuan Zhang and Yichang Zhang and Zhenru Zhang and Chang Zhou and Jingren Zhou and Xiaohuan Zhou and Tianhang Zhu},
      journal={arXiv preprint arXiv:2309.16609},
      year={2023}
    }
    

  

[](#license-agreement)License Agreement
---------------------------------------------------

[LICENSE](https://github.com/QwenLM/Qwen/blob/main/Tongyi%20Qianwen%20LICENSE%20AGREEMENT)[](https://dashscope.console.aliyun.com/openModelApply/Qwen-14B-Chat)

Our code and checkpoints are open to research purpose, and they are allowed for commercial purposes. Check [LICENSE](https://github.com/QwenLM/Qwen/blob/main/Tongyi%20Qianwen%20LICENSE%20AGREEMENT) for more details about the license. If you have requirements for commercial use, please fill out the [form](https://dashscope.console.aliyun.com/openModelApply/Qwen-14B-Chat) to apply.  

[](#contact-us)Contact Us
-------------------------------------

Discord[qianwen\_opensource@alibabacloud.com](mailto:qianwen_opensource@alibabacloud.com)

If you are interested to leave a message to either our research team or product team, join our Discord or WeChat groups! Also, feel free to send an email to [qianwen\_opensource@alibabacloud.com](mailto:qianwen_opensource@alibabacloud.com).

## Model overview

`Qwen-14B-Chat` is the 14B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-14B-Chat is a Transformer-based large language model that has been pretrained on a large volume of data, including web texts, books, and code. It has been further trained using alignment techniques to create an AI assistant with strong language understanding and generation capabilities.

Compared to the `Qwen-7B-Chat` model, Qwen-14B-Chat has double the parameter count and can thus handle more complex tasks and generate more coherent and relevant responses. It outperforms other similarly-sized models on a variety of benchmarks such as C-Eval, MMLU, and GSM8K.

## Model inputs and outputs

### Inputs
- Free-form text prompts, which can include instructions, questions, or open-ended statements.
- The model supports multi-turn dialogues, where the input can include the conversation history.

### Outputs
- Coherent, contextually relevant text responses generated by the model.
- The model can generate responses of varying length, from short single-sentence replies to longer multi-paragraph outputs.

## Capabilities

Qwen-14B-Chat has demonstrated strong performance on a wide range of tasks, including language understanding, reasoning, code generation, and tool usage. It achieves state-of-the-art results on benchmarks like C-Eval and MMLU, outperforming other large language models of similar size. 

The model also supports ReAct prompting, allowing it to call external APIs and plugins to assist with tasks that require accessing external information or functionality. This enables the model to handle more complex and open-ended prompts that require accessing external tools or data.

## What can I use it for?

Given its impressive capabilities, Qwen-14B-Chat can be a valuable tool for a variety of applications. Some potential use cases include:

- **Content generation**: The model can be used to generate high-quality text content such as articles, stories, or creative writing. Its strong language understanding and generation abilities make it well-suited for tasks like writing assistance, ideation, and summarization.

- **Conversational AI**: Qwen-14B-Chat's ability to engage in coherent, multi-turn dialogues makes it a promising candidate for building advanced chatbots and virtual assistants. Its ReAct prompting support also allows it to be integrated with other tools and services.

- **Task automation**: By leveraging the model's capabilities in areas like code generation, mathematical reasoning, and tool usage, it can be used to automate a variety of tasks that require language-based intelligence.

- **Research and experimentation**: As an open-source model, Qwen-14B-Chat provides a powerful platform for researchers and developers to explore the capabilities of large language models and experiment with new techniques and applications.

## Things to try

One interesting aspect of Qwen-14B-Chat is its strong performance on long-context tasks, thanks to the inclusion of techniques like NTK-aware interpolation and LogN attention scaling. Researchers and developers can experiment with using the model for tasks that require understanding and generating text with extended context, such as document summarization, long-form question answering, or multi-turn task-oriented dialogues.

Another intriguing area to explore is the model's ReAct prompting capabilities, which allow it to interact with external APIs and plugins. Users can try integrating Qwen-14B-Chat with a variety of tools and services to see how it can be leveraged for more complex, real-world applications that go beyond simple language generation.

[](#qwen-7b)Qwen-7B
===================

![](https://qianwen-res.oss-cn-beijing.aliyuncs.com/logo_qwen.jpg)

  

 [Hugging Face](https://huggingface.co/Qwen) |  [ModelScope](https://modelscope.cn/organization/qwen) |   [Paper](https://arxiv.org/abs/2309.16609)    [Demo](https://modelscope.cn/studios/qwen/Qwen-7B-Chat-Demo/summary)  
[WeChat ()](https://github.com/QwenLM/Qwen/blob/main/assets/wechat.png) | [Discord](https://discord.gg/z3GAxXZ9Ce)  [API](https://dashscope.aliyun.com)

  

[](#-introduction) (Introduction)
-------------------------------------

**-7BQwen-7B**70Qwen-7BTransformer, Qwen-7BAIQwen-7B-ChatQwen-7BChatQwen-7B

-7BQwen-7B

1.  ****2.4tokens
2.  ****Qwen-7B
3.  ****Qwen-7B15

7B[GitHub](https://github.com/QwenLM/Qwen)

**Qwen-7B** is the 7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-7B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-7B, we release Qwen-7B-Chat, a large-model-based AI assistant, which is trained with alignment techniques. Now we have updated both our pretrained and chat models for better performances. This repository is the one for the Qwen-7B base language model.

The features of Qwen-7B include:

1.  **Large-scale high-quality training corpora**: It is pretrained on over 2.4 trillion tokens, including Chinese, English, multilingual texts, code, and mathematics, covering general and professional fields. The distribution of the pre-training corpus has been optimized through a large number of ablation experiments.
2.  **Competitive performance**: It significantly surpasses existing open-source models of similar scale on multiple Chinese and English downstream evaluation tasks (including commonsense, reasoning, code, mathematics, etc.), and even surpasses some larger-scale models in several benchmarks. See below for specific evaluation results.
3.  **More comprehensive vocabulary coverage**: Compared with other open-source models based on Chinese and English vocabularies, Qwen-7B uses a vocabulary of over 150K tokens. This vocabulary is more friendly to multiple languages, enabling users to directly further enhance the capability for certain languages without expanding the vocabulary.

For more details about Qwen, please refer to the [GitHub](https://github.com/QwenLM/Qwen) code repository.  

[](#requirements)Requirements
-------------------------------------

*   python 3.8
*   pytorch 1.122.0
*   CUDA 11.4GPUflash-attention
*   python 3.8 and above
*   pytorch 1.12 and above, 2.0 and above are recommended
*   CUDA 11.4 and above are recommended (this is for GPU users, flash-attention users, etc.)  
    

[](#-dependency) (Dependency)
-----------------------------------

Qwen-7Bpip

To run Qwen-7B, please make sure you meet the above requirements, and then execute the following pip commands to install the dependent libraries.

    pip install transformers==4.32.0 accelerate tiktoken einops scipy transformers_stream_generator==0.0.4 peft deepspeed
    

`flash-attention`**flash attention 2**

In addition, it is recommended to install the `flash-attention` library (**we support flash attention 2 now.**) for higher efficiency and lower memory usage.

    git clone https://github.com/Dao-AILab/flash-attention
    cd flash-attention && pip install .
    # 
    # pip install csrc/layer_norm
    # pip install csrc/rotary
    

  

[](#quickstart)Quickstart
-------------------------------------



You can easily call the model with the following code:

    from transformers import AutoModelForCausalLM, AutoTokenizer
    from transformers.generation import GenerationConfig
    
    # Note: The default behavior now has injection attack prevention off.
    tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
    
    # use bf16
    # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, bf16=True).eval()
    # use fp16
    # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, fp16=True).eval()
    # use cpu only
    # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="cpu", trust_remote_code=True).eval()
    # use auto mode, automatically select precision based on the device.
    model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True).eval()
    
    # Specify hyperparameters for generation. But if you use transformers>=4.32.0, there is no need to do this.
    # model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
    
    inputs = tokenizer('Ulaanbaatar\nReykjavik\n', return_tensors='pt')
    inputs = inputs.to(model.device)
    pred = model.generate(**inputs)
    print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
    # Ulaanbaatar\nReykjavik\nAddis Ababa...
    

[GitHub repo](https://github.com/QwenLM/Qwen)

For more information, please refer to our [GitHub repo](https://github.com/QwenLM/Qwen) for more information.  

[](#tokenizer)Tokenizer
-----------------------

> tokenization

tiktokensentencepiecetokentokenizer[](https://github.com/QwenLM/Qwen/blob/main/tokenization_note_zh.md)

Our tokenizer based on tiktoken is different from other tokenizers, e.g., sentencepiece tokenizer. You need to pay attention to special tokens, especially in finetuning. For more detailed information on the tokenizer and related use in fine-tuning, please refer to the [documentation](https://github.com/QwenLM/Qwen/blob/main/tokenization_note.md).  

[](#-model) (Model)
---------------------------

Qwen-7B

The details of the model architecture of Qwen-7B are listed as follows.

Hyperparameter

Value

n\_layers

32

n\_heads

32

d\_model

4096

vocab size

151851

sequence length

8192

FFNnormalization RoPESwiGLURMSNormflash-attention

Qwen-7B15token GPT-4BPE`cl100k_base` [tiktoken](https://github.com/openai/tiktoken)

100100XLM-R1

Qwen-7Bthhearkovijatridplrunlptitdeesfr

2.4T tokens

![](/Qwen/Qwen-7B/resolve/main/assets/tokenizer.png)

For position encoding, FFN activation function, and normalization methods, we adopt the prevalent practices, i.e., RoPE relative position encoding, SwiGLU for activation function, and RMSNorm for normalization (optional installation of flash-attention for acceleration).

For tokenization, compared to the current mainstream open-source models based on Chinese and English vocabularies, Qwen-7B uses a vocabulary of over 150K tokens. It first considers efficient encoding of Chinese, English, and code data, and is also more friendly to multilingual languages, enabling users to directly enhance the capability of some languages without expanding the vocabulary. It segments numbers by single digit, and calls the [tiktoken](https://github.com/openai/tiktoken) tokenizer library for efficient tokenization.

We randomly selected 1 million document corpus of each language to test and compare the encoding compression rates of different models (with XLM-R, which supports 100 languages, as the base value 1). The specific performance is shown in the figure above.

As can be seen, while ensuring the efficient decoding of Chinese, English, and code, Qwen-7B also achieves a high compression rate for many other languages (such as th, he, ar, ko, vi, ja, tr, id, pl, ru, nl, pt, it, de, es, fr etc.), equipping the model with strong scalability as well as high training and inference efficiency in these languages.

The scale of pretraining corpus reaches over 2.4T tokens after deduplication and filtration, encompassing web text, encyclopedia, books, code, mathematics, and various domains.  

[](#evaluation)Evaluation
-------------------------------------

MMLUC-EvalGSM8K, MATH, HumanEval, MBPP, BBH, CMMLUbenchmarkQwenbenchmark

We selected MMLU, C-Eval, GSM8K, MATH, HumanEval, MBPP, BBH, CMMLU, which are currently popular benchmarks, to test the models Chinese and English knowledge capabilities, translation, mathematical reasoning, coding and other capabilities. From the following comprehensive evaluation results, we can see that the Qwen model outperform the similarly sized open-source models on all tasks.

Model

MMLU

C-Eval

GSM8K

MATH

HumanEval

MBPP

BBH

CMMLU

5-shot

5-shot

8-shot

4-shot

0-shot

3-shot

3-shot

5-shot

LLaMA2-7B

46.8

32.5

16.7

3.3

12.8

20.8

38.2

31.8

LLaMA2-13B

55.0

41.4

29.6

5.0

18.9

30.3

45.6

38.4

LLaMA2-34B

62.6

\-

42.2

6.2

22.6

33.0

44.1

\-

ChatGLM2-6B

47.9

51.7

32.4

6.5

\-

\-

33.7

\-

InternLM-7B

51.0

53.4

31.2

6.3

10.4

14.0

37.0

51.8

InternLM-20B

62.1

58.8

52.6

7.9

25.6

35.6

52.5

59.0

Baichuan2-7B

54.7

56.3

24.6

5.6

18.3

24.2

41.6

57.1

Baichuan2-13B

59.5

59.0

52.8

10.1

17.1

30.2

49.0

62.0

Qwen-7B (original)

56.7

59.6

51.6

\-

24.4

31.2

40.6

58.8

**Qwen-7B**

58.2

63.5

51.7

11.6

29.9

31.6

45.0

62.2

**Qwen-14B**

**66.3**

**72.1**

**61.3**

**24.8**

**32.3**

**40.8**

**53.4**

**71.0**

### [](#long-context-evaluation)Long-Context Evaluation

NTKLogNQwen-7B (original)14B2K8KQwen-7B8K32KarXivPPLQwen-7BQwen-14B

**(NTKLogNconfig.json`use_dynamic_ntk``use_logn_attn`true)**

We introduce NTK-aware interpolation, LogN attention scaling, Window attention, etc. to extend the context length to over 8K tokens. We conduct language modeling experiments on the arXiv dataset with the PPL evaluation. Results are demonstrated below:

**(To use NTK interpolation and LogN scaling, please set `use_dynamic_ntk` and `use_long_attn` to true in config.json.)**

Model

Sequence Length

1024

2048

4096

8192

16384

32768

Qwen-7B (original)

4.23

3.78

39.35

469.81

2645.09

\-

\+ dynamic\_ntk

4.23

3.78

3.59

3.66

5.71

\-

\+ dynamic\_ntk + logn

4.23

3.78

3.58

3.56

4.62

\-

\+ dynamic\_ntk + logn + window\_attn

4.23

3.78

3.58

3.49

4.32

\-

Qwen-7B

**4.23**

**3.81**

**3.52**

**3.31**

7.27

181.49

\+ dynamic\_ntk + logn + window\_attn

**4.23**

**3.81**

**3.52**

**3.33**

**3.22**

**3.17**

Qwen-14B

**\-**

**3.46**

22.79

334.65

3168.35

\-

\+ dynamic\_ntk + logn + window\_attn

**\-**

**3.46**

**3.29**

**3.18**

3.42

\-

[](#reproduction)Reproduction
-----------------------------------------

[](https://github.com/QwenLM/Qwen/tree/main/eval)

We have provided evaluation scripts to reproduce the performance of our model, details as [link](https://github.com/QwenLM/Qwen/tree/main/eval).  

[](#faq)FAQ
-----------

[FAQ](https://github.com/QwenLM/Qwen/blob/main/FAQ_zh.md)issueissue

If you meet problems, please refer to [FAQ](https://github.com/QwenLM/Qwen/blob/main/FAQ.md) and the issues first to search a solution before you launch a new issue.  

[](#-citation) (Citation)
-----------------------------



If you find our work helpful, feel free to give us a cite.

    @article{qwen,
      title={Qwen Technical Report},
      author={Jinze Bai and Shuai Bai and Yunfei Chu and Zeyu Cui and Kai Dang and Xiaodong Deng and Yang Fan and Wenbin Ge and Yu Han and Fei Huang and Binyuan Hui and Luo Ji and Mei Li and Junyang Lin and Runji Lin and Dayiheng Liu and Gao Liu and Chengqiang Lu and Keming Lu and Jianxin Ma and Rui Men and Xingzhang Ren and Xuancheng Ren and Chuanqi Tan and Sinan Tan and Jianhong Tu and Peng Wang and Shijie Wang and Wei Wang and Shengguang Wu and Benfeng Xu and Jin Xu and An Yang and Hao Yang and Jian Yang and Shusheng Yang and Yang Yao and Bowen Yu and Hongyi Yuan and Zheng Yuan and Jianwei Zhang and Xingxuan Zhang and Yichang Zhang and Zhenru Zhang and Chang Zhou and Jingren Zhou and Xiaohuan Zhou and Tianhang Zhu},
      journal={arXiv preprint arXiv:2309.16609},
      year={2023}
    }
    

  

[](#license-agreement)License Agreement
---------------------------------------------------

[LICENSE](https://github.com/QwenLM/Qwen/blob/main/Tongyi%20Qianwen%20LICENSE%20AGREEMENT)[](https://dashscope.console.aliyun.com/openModelApply/qianwen)

Our code and checkpoints are open to research purpose, and they are allowed for commercial purposes. Check [LICENSE](https://github.com/QwenLM/Qwen/blob/main/Tongyi%20Qianwen%20LICENSE%20AGREEMENT) for more details about the license. If you have requirements for commercial use, please fill out the [form](https://dashscope.console.aliyun.com/openModelApply/qianwen) to apply.  

[](#contact-us)Contact Us
-------------------------------------

Discord[qianwen\_opensource@alibabacloud.com](mailto:qianwen_opensource@alibabacloud.com)

If you are interested to leave a message to either our research team or product team, join our Discord or WeChat groups! Also, feel free to send an email to [qianwen\_opensource@alibabacloud.com](mailto:qianwen_opensource@alibabacloud.com).

## Model overview

`Qwen-7B` is the 7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-7B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-7B, the maintainers release [Qwen-7B-Chat](https://aimodels.fyi/models/huggingFace/qwen-7b-chat-qwen), a large-model-based AI assistant, which is trained with alignment techniques.

Qwen-7B significantly surpasses existing open-source models of similar scale on multiple Chinese and English downstream evaluation tasks, and even outperforms some larger-scale models in several benchmarks. Compared to other open-source models, Qwen-7B uses a more comprehensive vocabulary of over 150K tokens, which is more friendly to multiple languages.

## Model inputs and outputs

### Inputs
- **Text prompt**: Qwen-7B accepts text prompts as input to generate output text.

### Outputs
- **Generated text**: Qwen-7B generates relevant text based on the input prompt.

## Capabilities

Qwen-7B demonstrates strong performance on a variety of benchmarks, including commonsense reasoning, coding, mathematics, and more. The model is also capable of engaging in open-ended conversation through the [Qwen-7B-Chat](https://aimodels.fyi/models/huggingFace/qwen-7b-chat-qwen) version.

## What can I use it for?

Qwen-7B and Qwen-7B-Chat can be used for a wide range of natural language processing tasks, such as text generation, question answering, and language understanding. The large-scale pretraining and strong performance make these models suitable for tasks like content creation, customer service chatbots, and even code generation. The maintainers also provide an [API](https://dashscope.aliyun.com) for users to integrate the models into their applications.

## Things to try

Given Qwen-7B's strong performance on benchmarks, users can experiment with fine-tuning the model on specialized datasets to further enhance its capabilities for specific domains or tasks. The maintainers also provide [intermediate checkpoints](https://github.com/QwenLM/Qwen/blob/main/tokenization_note.md) during the pretraining process, which can be used to study the model's learning dynamics. Additionally, the quantized versions of Qwen-7B-Chat offer improved inference speed and memory usage, making them suitable for deployment on resource-constrained environments.

[](#qwen2-7b-instruct)Qwen2-7B-Instruct
=======================================

[](#introduction)Introduction
-----------------------------

Qwen2 is the new series of Qwen large language models. For Qwen2, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters, including a Mixture-of-Experts model. This repo contains the instruction-tuned 7B Qwen2 model.

Compared with the state-of-the-art opensource language models, including the previous released Qwen1.5, Qwen2 has generally surpassed most opensource models and demonstrated competitiveness against proprietary models across a series of benchmarks targeting for language understanding, language generation, multilingual capability, coding, mathematics, reasoning, etc.

Qwen2-7B-Instruct supports a context length of up to 131,072 tokens, enabling the processing of extensive inputs. Please refer to [this section](#processing-long-texts) for detailed instructions on how to deploy Qwen2 for handling long texts.

For more details, please refer to our [blog](https://qwenlm.github.io/blog/qwen2/), [GitHub](https://github.com/QwenLM/Qwen2), and [Documentation](https://qwen.readthedocs.io/en/latest/).  

[](#model-details)Model Details
-------------------------------

Qwen2 is a language model series including decoder language models of different model sizes. For each size, we release the base language model and the aligned chat model. It is based on the Transformer architecture with SwiGLU activation, attention QKV bias, group query attention, etc. Additionally, we have an improved tokenizer adaptive to multiple natural languages and codes.

[](#training-details)Training details
-------------------------------------

We pretrained the models with a large amount of data, and we post-trained the models with both supervised finetuning and direct preference optimization.

[](#requirements)Requirements
-----------------------------

The code of Qwen2 has been in the latest Hugging face transformers and we advise you to install `transformers>=4.37.0`, or you might encounter the following error:

    KeyError: 'qwen2'
    

[](#quickstart)Quickstart
-------------------------

Here provides a code snippet with `apply_chat_template` to show you how to load the tokenizer and model and how to generate contents.

    from transformers import AutoModelForCausalLM, AutoTokenizer
    device = "cuda" # the device to load the model onto
    
    model = AutoModelForCausalLM.from_pretrained(
        "Qwen/Qwen2-7B-Instruct",
        torch_dtype="auto",
        device_map="auto"
    )
    tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-7B-Instruct")
    
    prompt = "Give me a short introduction to large language model."
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(device)
    
    generated_ids = model.generate(
        model_inputs.input_ids,
        max_new_tokens=512
    )
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]
    
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    

### [](#processing-long-texts)Processing Long Texts

To handle extensive inputs exceeding 32,768 tokens, we utilize [YARN](https://arxiv.org/abs/2309.00071), a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts.

For deployment, we recommend using vLLM. You can enable the long-context capabilities by following these steps:

1.  **Install vLLM**: You can install vLLM by running the following command.

    pip install "vllm>=0.4.3"
    

Or you can install vLLM from [source](https://github.com/vllm-project/vllm/).

2.  **Configure Model Settings**: After downloading the model weights, modify the `config.json` file by including the below snippet:
    
            {
                "architectures": [
                    "Qwen2ForCausalLM"
                ],
                // ...
                "vocab_size": 152064,
        
                // adding the following snippets
                "rope_scaling": {
                    "factor": 4.0,
                    "original_max_position_embeddings": 32768,
                    "type": "yarn"
                }
            }
        
    
    This snippet enable YARN to support longer contexts.
    
3.  **Model Deployment**: Utilize vLLM to deploy your model. For instance, you can set up an openAI-like server using the command:
    
        python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-7B-Instruct --model path/to/weights
        
    
    Then you can access the Chat API by:
    
        curl http://localhost:8000/v1/chat/completions \
            -H "Content-Type: application/json" \
            -d '{
            "model": "Qwen2-7B-Instruct",
            "messages": [
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": "Your Long Input Here."}
            ]
            }'
        
    
    For further usage instructions of vLLM, please refer to our [Github](https://github.com/QwenLM/Qwen2).
    

**Note**: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts**. We advise adding the `rope_scaling` configuration only when processing long contexts is required.

[](#evaluation)Evaluation
-------------------------

We briefly compare Qwen2-7B-Instruct with similar-sized instruction-tuned LLMs, including Qwen1.5-7B-Chat. The results are shown below:

Datasets

Llama-3-8B-Instruct

Yi-1.5-9B-Chat

GLM-4-9B-Chat

Qwen1.5-7B-Chat

Qwen2-7B-Instruct

_**English**_

MMLU

68.4

69.5

**72.4**

59.5

70.5

MMLU-Pro

41.0

\-

\-

29.1

**44.1**

GPQA

**34.2**

\-

**\-**

27.8

25.3

TheroemQA

23.0

\-

\-

14.1

**25.3**

MT-Bench

8.05

8.20

8.35

7.60

**8.41**

_**Coding**_

Humaneval

62.2

66.5

71.8

46.3

**79.9**

MBPP

**67.9**

\-

\-

48.9

67.2

MultiPL-E

48.5

\-

\-

27.2

**59.1**

Evalplus

60.9

\-

\-

44.8

**70.3**

LiveCodeBench

17.3

\-

\-

6.0

**26.6**

_**Mathematics**_

GSM8K

79.6

**84.8**

79.6

60.3

82.3

MATH

30.0

47.7

**50.6**

23.2

49.6

_**Chinese**_

C-Eval

45.9

\-

75.6

67.3

**77.2**

AlignBench

6.20

6.90

7.01

6.20

**7.21**

[](#citation)Citation
---------------------

If you find our work helpful, feel free to give us a cite.

    @article{qwen2,
      title={Qwen2 Technical Report},
      year={2024}
    }

## Model overview

The `Qwen2-7B-Instruct` is the 7 billion parameter instruction-tuned language model from the Qwen2 series of large language models developed by Qwen. Compared to state-of-the-art open-source language models like [LLaMA](https://aimodels.fyi/models/huggingFace/qwen2-72b-qwen) and [ChatGLM](https://aimodels.fyi/models/huggingFace/qwen15-7b-chat-qwen), the Qwen2 series has generally surpassed them in performance across a range of benchmarks targeting language understanding, generation, multilingual capabilities, coding, mathematics, and reasoning.

The Qwen2 series includes models ranging from 0.5 to 72 billion parameters, with the `Qwen2-7B-Instruct` being one of the smaller yet capable instruction-tuned variants. It is based on the Transformer architecture with enhancements like SwiGLU activation, attention QKV bias, and group query attention. The model also uses an improved tokenizer that is adaptive to multiple natural languages and coding.

## Model inputs and outputs

### Inputs
- **Text**: The model can take text inputs of up to 131,072 tokens, enabling processing of extensive inputs.

### Outputs
- **Text**: The model generates text outputs, which can be used for a variety of natural language tasks such as question answering, summarization, and creative writing.

## Capabilities

The `Qwen2-7B-Instruct` model has shown strong performance across a range of benchmarks, including language understanding (MMLU, C-Eval), mathematics (GSM8K, MATH), coding (HumanEval, MBPP), and reasoning (BBH). It has demonstrated competitiveness against proprietary models in these areas.

## What can I use it for?

The `Qwen2-7B-Instruct` model can be used for a variety of natural language processing tasks, such as:

- **Question answering**: The model can be used to answer questions on a wide range of topics, drawing upon its broad knowledge base.
- **Summarization**: The model can be used to generate concise summaries of long-form text, such as articles or reports.
- **Creative writing**: The model can be used to generate original text, such as stories, poems, or scripts, with its strong language generation capabilities.
- **Coding assistance**: The model's coding knowledge can be leveraged to help with tasks like code generation, explanation, and debugging.

## Things to try

One interesting aspect of the `Qwen2-7B-Instruct` model is its ability to process long-form text inputs, thanks to its large context length of up to 131,072 tokens. This can be particularly useful for tasks that require understanding and reasoning over extensive information, such as academic papers, legal documents, or historical archives.

Another area to explore is the model's multilingual capabilities. As mentioned, the Qwen2 series, including the `Qwen2-7B-Instruct`, has been designed to be adaptive to multiple languages, which could make it a valuable tool for cross-lingual applications.

[](#qwen-72b)Qwen-72B
=====================

![](https://qianwen-res.oss-cn-beijing.aliyuncs.com/logo_qwen.jpg)

  

 [Hugging Face](https://huggingface.co/Qwen) |  [ModelScope](https://modelscope.cn/organization/qwen) |   [Paper](https://arxiv.org/abs/2309.16609)    [Demo](https://modelscope.cn/studios/qwen/Qwen-72B-Chat-Demo/summary)  
[WeChat ()](https://github.com/QwenLM/Qwen/blob/main/assets/wechat.png) | [Discord](https://discord.gg/z3GAxXZ9Ce)  [API](https://dashscope.aliyun.com)

  

[](#-introduction) (Introduction)
-------------------------------------

**-72B****Qwen-72B**720Qwen-72BTransformer, Qwen-72BAIQwen-72B-ChatQwen-72B

-72BQwen-72B

1.  ****3tokens
2.  ****Qwen-72B
3.  ****Qwen-72B15
4.  ****Qwen-72B32k

72B[GitHub](https://github.com/QwenLM/Qwen)

**Qwen-72B** is the 72B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-72B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-72B, we release Qwen-72B-Chat, a large-model-based AI assistant, which is trained with alignment techniques. This repository is the one for Qwen-72B.

The features of Qwen-72B include:

1.  **Large-scale high-quality training corpora**: It is pretrained on over 3 trillion tokens, including Chinese, English, multilingual texts, code, and mathematics, covering general and professional fields. The distribution of the pre-training corpus has been optimized through a large number of ablation experiments.
2.  **Competitive performance**: It significantly surpasses existing open-source models on multiple Chinese and English downstream evaluation tasks (including commonsense, reasoning, code, mathematics, etc.). See below for specific evaluation results.
3.  **More comprehensive vocabulary coverage**: Compared with other open-source models based on Chinese and English vocabularies, Qwen-72B uses a vocabulary of over 150K tokens. This vocabulary is more friendly to multiple languages, enabling users to directly further enhance the capability for certain languages without expanding the vocabulary.
4.  **Longer context support**: Qwen-72B supports 32k context length.

For more details about the open-source model of Qwen-72B, please refer to the [GitHub](https://github.com/QwenLM/Qwen) code repository.  

[](#requirements)Requirements
-------------------------------------

*   python 3.8
*   pytorch 1.122.0
*   CUDA 11.4GPUflash-attention
*   **BF16FP16144GB2xA100-80G5xV100-32GInt448GB1xA100-80G2xV100-32G**
*   python 3.8 and above
*   pytorch 1.12 and above, 2.0 and above are recommended
*   CUDA 11.4 and above are recommended (this is for GPU users, flash-attention users, etc.) **To run Qwen-72B-Chat in bf16/fp16, at least 144GB GPU memory is required (e.g., 2xA100-80G or 5xV100-32G). To run it in int4, at least 48GB GPU memory is requred (e.g., 1xA100-80G or 2xV100-32G).**  
    

[](#-dependency) (Dependency)
-----------------------------------

Qwen-72Bpip

To run Qwen-72B, please make sure you meet the above requirements, and then execute the following pip commands to install the dependent libraries.

    pip install transformers==4.32.0 accelerate tiktoken einops scipy transformers_stream_generator==0.0.4 peft deepspeed
    

`flash-attention`**flash attention 2**

In addition, it is recommended to install the `flash-attention` library (**we support flash attention 2 now.**) for higher efficiency and lower memory usage.

    git clone https://github.com/Dao-AILab/flash-attention
    cd flash-attention && pip install .
    # 
    # Below are optional. Installing them might be slow.
    # pip install csrc/layer_norm
    # flash-attn2.1.1
    # If the version of flash-attn is higher than 2.1.1, the following is not needed.
    # pip install csrc/rotary
    

  

[](#quickstart)Quickstart
-------------------------------------



You can easily call the model with the following code:

    from transformers import AutoModelForCausalLM, AutoTokenizer
    from transformers.generation import GenerationConfig
    
    # Note: The default behavior now has injection attack prevention off.
    tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-72B", trust_remote_code=True)
    
    # use bf16
    # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-72B", device_map="auto", trust_remote_code=True, bf16=True).eval()
    # use fp16
    # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-72B", device_map="auto", trust_remote_code=True, fp16=True).eval()
    # use cpu only
    # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-72B", device_map="cpu", trust_remote_code=True).eval()
    # use auto mode, automatically select precision based on the device.
    model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-72B", device_map="auto", trust_remote_code=True).eval()
    
    # Specify hyperparameters for generation. But if you use transformers>=4.32.0, there is no need to do this.
    # model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-72B", trust_remote_code=True)
    
    inputs = tokenizer('Ulaanbaatar\nReykjavik\n', return_tensors='pt')
    inputs = inputs.to(model.device)
    pred = model.generate(**inputs)
    print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
    # Ulaanbaatar\nReykjavik\nAddis Ababa...
    

[GitHub repo](https://github.com/QwenLM/Qwen)

For more information, please refer to our [GitHub repo](https://github.com/QwenLM/Qwen) for more information.  

[](#tokenizer)Tokenizer
-----------------------

> tokenization

tiktokensentencepiecetokentokenizer[](https://github.com/QwenLM/Qwen/blob/main/tokenization_note_zh.md)

Our tokenizer based on tiktoken is different from other tokenizers, e.g., sentencepiece tokenizer. You need to pay attention to special tokens, especially in finetuning. For more detailed information on the tokenizer and related use in fine-tuning, please refer to the [documentation](https://github.com/QwenLM/Qwen/blob/main/tokenization_note.md).  

[](#-model) (Model)
---------------------------

Qwen-72B

The details of the model architecture of Qwen-72B are listed as follows:

Hyperparameter

Value

n\_layers

80

n\_heads

64

d\_model

8192

vocab size

151851

sequence length

32768

FFNnormalization RoPESwiGLURMSNormflash-attention

Qwen-72B15token GPT-4BPE`cl100k_base` [tiktoken](https://github.com/openai/tiktoken)

100100XLM-R1

Qwen-72Bthhearkovijatridplrunlptitdeesfr

Qwen-72B 3T tokens 

![](/Qwen/Qwen-72B/resolve/main/assets/tokenizer.png)

For position encoding, FFN activation function, and normalization methods, we adopt the prevalent practices, i.e., RoPE relative position encoding, SwiGLU for activation function, and RMSNorm for normalization (optional installation of flash-attention for acceleration).

For tokenization, compared to the current mainstream open-source models based on Chinese and English vocabularies, Qwen-72B uses a vocabulary of over 150K tokens. It first considers efficient encoding of Chinese, English, and code data, and is also more friendly to multilingual languages, enabling users to directly enhance the capability of some languages without expanding the vocabulary. It segments numbers by single digit, and calls the [tiktoken](https://github.com/openai/tiktoken) tokenizer library for efficient tokenization.

We randomly selected 1 million document corpus of each language to test and compare the encoding compression rates of different models (with XLM-R, which supports 100 languages, as the base value 1). The specific performance is shown in the figure above.

As can be seen, while ensuring the efficient decoding of Chinese, English, and code, Qwen-72B also achieves a high compression rate for many other languages (such as th, he, ar, ko, vi, ja, tr, id, pl, ru, nl, pt, it, de, es, fr etc.), equipping the model with strong scalability as well as high training and inference efficiency in these languages.

For pre-training data, on the one hand, Qwen-72B uses part of the open-source generic corpus. On the other hand, it uses a massive amount of accumulated web corpus and high-quality text content. The scale of corpus reaches over 3T tokens after deduplication and filtration, encompassing web text, encyclopedias, books, code, mathematics, and various domain.  

[](#evaluation)Evaluation
-------------------------------------

MMLUC-EvalGSM8K, MATH, HumanEval, MBPP, BBH, CMMLUbenchmarkQwen-72Bbenchmark

We selected MMLU, C-Eval, GSM8K, MATH, HumanEval, MBPP, BBH, CMMLU, which are currently popular benchmarks, to test the models Chinese and English knowledge capabilities, translation, mathematical reasoning, coding and other capabilities. From the following comprehensive evaluation results, we can see that the Qwen model outperform the similarly sized open-source models on all tasks.

Model

Avg

MMLU

C-Eval

GSM8K

MATH

HumanEval

MBPP

BBH

AGIEval

GaokaoBench

CMMLU

5-shot

5-shot

8-shot

4-shot

0-shot

3-shot

3-shot

0-shot

0-shot

5-shot

LLaMA2-7B

24.4

46.8

32.5

16.7

3.3

12.8

20.8

38.2

21.8

18.9

31.8

LLaMA2-13B

31.3

55.0

41.4

29.6

5.0

18.9

30.3

45.6

30.9

18.2

38.4

LLaMA2-70B

45.7

69.7

50.1

63.5

12.0

26.2

39.6

64.9

54.2

23.3

53.6

InternLM-20B

47.2

62.1

58.8

52.6

7.9

25.6

35.6

52.5

59.0

59.0

59.0

Yi-34B

58.0

76.3

81.8

67.9

15.9

26.2

38.2

66.4

56.5

68.3

82.6

XVERSE-65B

\-

70.8

68.6

60.3

\-

26.3

\-

\-

\-

\-

\-

**Qwen-7B**

46.2

58.2

63.5

51.7

11.6

29.9

31.6

45.0

45.3

62.5

62.2

**Qwen-14B**

52.7

66.3

72.1

61.3

24.8

32.3

40.8

53.4

51.9

52.7

71.0

**Qwen-72B**

**66.4**

**77.4**

**83.3**

**78.9**

**35.2**

**35.4**

**52.2**

**67.7**

**62.5**

**87.6**

**83.6**

### [](#long-context-evaluation)Long-Context Evaluation

Qwen-72BRoPE base32karXivPPL

Qwen-72B uses the method of extending RoPE base and supports the extrapolation length of 32k. We use arXiv data for language modeling evaluation. The PPL (lower is better) results are as follows:

Model

Sequence Length

8192

16384

32768

Qwen-72B

2.828

2.734

2.717

[](#reproduction)Reproduction
-----------------------------------------

[](https://github.com/QwenLM/Qwen/tree/main/eval)

We have provided evaluation scripts to reproduce the performance of our model, details as [link](https://github.com/QwenLM/Qwen/tree/main/eval).  

[](#faq)FAQ
-----------

[FAQ](https://github.com/QwenLM/Qwen/blob/main/FAQ_zh.md)issueissue

If you meet problems, please refer to [FAQ](https://github.com/QwenLM/Qwen/blob/main/FAQ.md) and the issues first to search a solution before you launch a new issue.  

[](#-citation) (Citation)
-----------------------------



If you find our work helpful, feel free to give us a cite.

    @article{qwen,
      title={Qwen Technical Report},
      author={Jinze Bai and Shuai Bai and Yunfei Chu and Zeyu Cui and Kai Dang and Xiaodong Deng and Yang Fan and Wenbin Ge and Yu Han and Fei Huang and Binyuan Hui and Luo Ji and Mei Li and Junyang Lin and Runji Lin and Dayiheng Liu and Gao Liu and Chengqiang Lu and Keming Lu and Jianxin Ma and Rui Men and Xingzhang Ren and Xuancheng Ren and Chuanqi Tan and Sinan Tan and Jianhong Tu and Peng Wang and Shijie Wang and Wei Wang and Shengguang Wu and Benfeng Xu and Jin Xu and An Yang and Hao Yang and Jian Yang and Shusheng Yang and Yang Yao and Bowen Yu and Hongyi Yuan and Zheng Yuan and Jianwei Zhang and Xingxuan Zhang and Yichang Zhang and Zhenru Zhang and Chang Zhou and Jingren Zhou and Xiaohuan Zhou and Tianhang Zhu},
      journal={arXiv preprint arXiv:2309.16609},
      year={2023}
    }
    

  

[](#license-agreement)License Agreement
---------------------------------------------------

[LICENSE](https://github.com/QwenLM/Qwen/blob/main/Tongyi%20Qianwen%20LICENSE%20AGREEMENT)[](https://dashscope.console.aliyun.com/openModelApply/Qwen-72B-Chat)

Our code and checkpoints are open to research purpose, and they are allowed for commercial purposes. Check [LICENSE](https://github.com/QwenLM/Qwen/blob/main/Tongyi%20Qianwen%20LICENSE%20AGREEMENT) for more details about the license. If you have requirements for commercial use, please fill out the [form](https://dashscope.console.aliyun.com/openModelApply/Qwen-72B-Chat) to apply.  

[](#contact-us)Contact Us
-------------------------------------

Discord[qianwen\_opensource@alibabacloud.com](mailto:qianwen_opensource@alibabacloud.com)

If you are interested to leave a message to either our research team or product team, join our Discord or WeChat groups! Also, feel free to send an email to [qianwen\_opensource@alibabacloud.com](mailto:qianwen_opensource@alibabacloud.com).

## Model overview

`Qwen-72B` is the 72B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by [Alibaba Cloud](https://aimodels.fyi/creators/huggingFace/Qwen). Qwen-72B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-72B, Qwen releases Qwen-72B-Chat, a large-model-based AI assistant, which is trained with alignment techniques.

Key features of Qwen-72B include:

1. **Large-scale high-quality training corpora**: It is pretrained on over 3 trillion tokens, including Chinese, English, multilingual texts, code, and mathematics, covering general and professional fields.
2. **Competitive performance**: It significantly surpasses existing open-source models on multiple Chinese and English downstream evaluation tasks.
3. **More comprehensive vocabulary coverage**: Compared to other models, Qwen-72B uses a vocabulary of over 150K tokens, allowing for more efficient encoding of multiple languages.
4. **Longer context support**: Qwen-72B supports a context length of up to 32k tokens.

## Model inputs and outputs

### Inputs
- **Text**: Qwen-72B accepts text inputs in a variety of languages, including Chinese, English, and others.

### Outputs
- **Text**: Qwen-72B generates fluent and coherent text outputs in response to the input, drawing upon its broad knowledge base.
- **Code**: In addition to natural language, Qwen-72B can also generate code in various programming languages.

## Capabilities

Qwen-72B demonstrates impressive performance on a wide range of tasks, including commonsense reasoning, language understanding, mathematical problem-solving, and code generation. For example, it achieves state-of-the-art results on benchmarks like MMLU, C-Eval, and HumanEval, outperforming many other large language models of similar or even larger scale.

## What can I use it for?

With its broad capabilities, Qwen-72B can be leveraged for a variety of applications, such as:

- **Content creation**: Generating high-quality text, articles, stories, and dialogues in multiple languages.
- **Conversational AI**: Powering intelligent chatbots and virtual assistants with advanced language understanding and generation abilities.
- **Code generation and programming**: Assisting developers with tasks like code completion, refactoring, and even full-fledged program generation.
- **Multilingual applications**: Developing multilingual applications that can seamlessly handle and translate between various languages.

## Things to try

One interesting aspect of Qwen-72B is its ability to handle long-form text and extended context. You could try generating coherent and relevant output based on lengthy prompts or multi-turn dialogues, exploring how the model maintains context and produces consistent responses over time.

Another area to experiment with is the model's code generation capabilities. You could provide Qwen-72B with programming prompts or partially completed code snippets and observe how it can extend and refine the code to solve specific tasks or implement desired functionalities.

[](#qwen-vl-chat)Qwen-VL-Chat
=============================

  

![](https://qianwen-res.oss-cn-beijing.aliyuncs.com/logo_vl.jpg)

  

Qwen-VL [](https://huggingface.co/Qwen/Qwen-VL) [](https://modelscope.cn/models/qwen/Qwen-VL/summary)  Qwen-VL-Chat [](https://huggingface.co/Qwen/Qwen-VL-Chat) [](https://modelscope.cn/models/qwen/Qwen-VL-Chat/summary) (Int4: [](https://huggingface.co/Qwen/Qwen-VL-Chat-Int4) [](https://modelscope.cn/models/qwen/Qwen-VL-Chat-Int4/summary))  Qwen-VL-Plus [](https://huggingface.co/spaces/Qwen/Qwen-VL-Plus) [](https://modelscope.cn/studios/qwen/Qwen-VL-Chat-Demo/summary)  Qwen-VL-Max [](https://huggingface.co/spaces/Qwen/Qwen-VL-Max) [](https://modelscope.cn/studios/qwen/Qwen-VL-Max/summary)  
[Web](https://tongyi.aliyun.com/qianwen) |  [API](https://help.aliyun.com/zh/dashscope/developer-reference/vl-plus-quick-start) |  [WeChat](assets/wechat.png) |  [Discord](https://discord.gg/z3GAxXZ9Ce) |  [Paper](https://arxiv.org/abs/2308.12966) |  [Tutorial](TUTORIAL.md)

  

**Qwen-VL** Large Vision Language Model, LVLMQwen-VL Qwen-VL 

**Qwen-VL** (Qwen Large Vision Language Model) is the visual multimodal version of the large model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-VL accepts image, text, and bounding box as inputs, outputs text and bounding box. The features of Qwen-VL include:

Qwen-VLQwen-VL-ChatChat[](https://github.com/QwenLM/Qwen-VL/blob/master/visual_memo.md)Qwen-VL-Chat

We release Qwen-VL and Qwen-VL-Chat, which are pretrained model and Chat model respectively. For more details about Qwen-VL, please refer to our [technical memo](https://github.com/QwenLM/Qwen-VL/blob/master/visual_memo.md). This repo is the one for Qwen-VL-Chat.  

[](#-requirements) (Requirements)
-----------------------------------------

*   python 3.8
*   pytorch 1.122.0
*   CUDA 11.4GPU
*   python 3.8 and above
*   pytorch 1.12 and above, 2.0 and above are recommended
*   CUDA 11.4 and above are recommended (this is for GPU users)  
    

[](#-quickstart) (Quickstart)
-------------------------------------

  Transformers Qwen-VL-Chat



Below, we provide simple examples to show how to use Qwen-VL-Chat with  Transformers.

Before running the code, make sure you have setup the environment and installed the required packages. Make sure you meet the above requirements, and then install the dependent libraries.

    pip install -r requirements.txt
    

Transformers[](/Qwen/Qwen-VL-Chat/blob/main/TUTORIAL.md)

Now you can start with Transformers. More usage aboue vision encoder, please refer to [tutorial](/Qwen/Qwen-VL-Chat/blob/main/TUTORIAL_zh.md).

#### [](#-transformers) Transformers

To use Qwen-VL-Chat for the inference, all you need to do is to input a few lines of codes as demonstrated below. However, **please make sure that you are using the latest code.**

    from transformers import AutoModelForCausalLM, AutoTokenizer
    from transformers.generation import GenerationConfig
    import torch
    torch.manual_seed(1234)
    
    # Note: The default behavior now has injection attack prevention off.
    tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)
    
    # use bf16
    # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
    # use fp16
    # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
    # use cpu only
    # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cpu", trust_remote_code=True).eval()
    # use cuda device
    model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cuda", trust_remote_code=True).eval()
    
    # Specify hyperparameters for generation (No need to do this if you are using transformers>=4.32.0)
    # model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)
    
    # 1st dialogue turn
    query = tokenizer.from_list_format([
        {'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'},
        {'text': ''},
    ])
    response, history = model.chat(tokenizer, query=query, history=None)
    print(response)
    # 
    
    # 2nd dialogue turn
    response, history = model.chat(tokenizer, '""', history=history)
    print(response)
    # <ref></ref><box>(517,508),(589,611)</box>
    image = tokenizer.draw_bbox_on_latest_picture(response, history)
    if image:
      image.save('1.jpg')
    else:
      print("no box")
    

![](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo_highfive.jpg)

  

[](#-quantization) (Quantization)
-------------------------------------

### [](#-usage) (Usage)

[AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)Qwen-VL-ChatInt4Qwen-VL-Chat-Int4 [](https://huggingface.co/Qwen/Qwen-VL-Chat-Int4)

torch2.0transformers 4.32.0

We provide a new solution based on [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), and release an Int4 quantized model for Qwen-VL-Chat, Qwen-VL-Chat-Int4 [Click here](https://huggingface.co/Qwen/Qwen-VL-Chat-Int4), which achieves nearly lossless model effects but improved performance on both memory costs and inference speed.

Here we demonstrate how to use our provided quantized models for inference. Before you start, make sure you meet the requirements (e.g., torch 2.0 and above, transformers 4.32.0 and above, etc.) and install the required packages:

    pip install optimum
    git clone https://github.com/JustinLin610/AutoGPTQ.git & cd AutoGPTQ
    pip install -v .
    

 `auto-gptq` [repo](https://github.com/PanQiWei/AutoGPTQ) wheel



If you meet problems installing `auto-gptq`, we advise you to check out the official [repo](https://github.com/PanQiWei/AutoGPTQ) to find a wheel.

Then you can load the quantized model easily and run inference as same as usual:

    model = AutoModelForCausalLM.from_pretrained(
        "Qwen/Qwen-VL-Chat-Int4",
        device_map="auto",
        trust_remote_code=True
    ).eval()
    # Either a local path or an u[](https://)rl between <img></img> tags.
    image_path = 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'
    response, history = model.chat(tokenizer, query=f'<img>{image_path}</img>', history=None)
    print(response)
    

### [](#-performance) (Performance)

 **[TouchStone](https://github.com/OFA-Sys/TouchStone)** 

We illustrate the model performance of both BF16 and Int4 models on the benchmark **[TouchStone](https://github.com/OFA-Sys/TouchStone)**, and we find that the quantized model does not suffer from significant performance degradation. Results are shown below:

Quantization

ZH.

EN

BF16

401.2

645.2

Int4

386.6

651.4

### [](#-inference-speed) (Inference Speed)

258tokenBF16Int41792 (2048-258)  7934 (8192-258) token

We measured the average inference speed (tokens/s) of generating 1792 (2048-258) and 7934 (8192-258) tokens with the context of an image (which takes 258 tokens) under BF16 precision and Int4 quantization, respectively.

Quantization

Speed (2048 tokens)

Speed (8192 tokens)

BF16

28.87

24.32

Int4

37.79

34.34

 A100-SXM4-80G GPUPyTorch 2.0.1CUDA 11.4

The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.4.

### [](#gpu-gpu-memory-usage)GPU (GPU Memory Usage)

BF16Int41792 (2048-258)  7934 (8192-258) token

We also profile the peak GPU memory usage for encoding 1792 (2048-258) tokens (including an image) as context (and generating single token) and generating 7934 (8192-258) tokens (with an image as context) under BF16 or Int4 quantization level, respectively. The results are shown below.

Quantization

Peak Usage for Encoding 2048 Tokens

Peak Usage for Generating 8192 Tokens

BF16

22.60GB

28.01GB

Int4

11.82GB

17.23GB

[](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile_mm.py)

The above speed and memory profiling are conducted using [this script](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile_mm.py).  

[](#)
---------



1.  ** Benchmark** 
    
    *   Zero-shot Caption: 
    *   General VQA: 
    *   Text-based VQA/
    *   Referring Expression Compression
2.  \*\* (TouchStone)\*\* GPT4  LVLM  BenchmarkTouchStone TouchStone-v0.1 
    
    *    300+800+27****
    *    GPT4 **** GPT4 
    *   



We evaluated the model's ability from two perspectives:

1.  **Standard Benchmarks**: We evaluate the model's basic task capabilities on four major categories of multimodal tasks:
    
    *   Zero-shot Caption: Evaluate model's zero-shot image captioning ability on unseen datasets;
    *   General VQA: Evaluate the general question-answering ability of pictures, such as the judgment, color, number, category, etc;
    *   Text-based VQA: Evaluate the model's ability to recognize text in pictures, such as document QA, chart QA, etc;
    *   Referring Expression Comprehension: Evaluate the ability to localize a target object in an image described by a referring expression.
2.  **TouchStone**: To evaluate the overall text-image dialogue capability and alignment level with humans, we have constructed a benchmark called TouchStone, which is based on scoring with GPT4 to evaluate the LVLM model.
    
    *   The TouchStone benchmark covers a total of 300+ images, 800+ questions, and 27 categories. Such as attribute-based Q&A, celebrity recognition, writing poetry, summarizing multiple images, product comparison, math problem solving, etc;
    *   In order to break the current limitation of GPT4 in terms of direct image input, TouchStone provides fine-grained image annotations by human labeling. These detailed annotations, along with the questions and the model's output, are then presented to GPT4 for scoring.
    *   The benchmark includes both English and Chinese versions.

The results of the evaluation are as follows:

Qwen-VL outperforms current SOTA generalist models on multiple VL tasks and has a more comprehensive coverage in terms of capability range.

![](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/radar.png)

### [](#---zero-shot-captioning--general-vqa) &  (Zero-shot Captioning & General VQA)

Model type

Model

Zero-shot Captioning

General VQA

NoCaps

Flickr30K

VQAv2dev

OK-VQA

GQA

SciQA-Img  
(0-shot)

VizWiz  
(0-shot)

Generalist  
Models

Flamingo-9B

\-

61.5

51.8

44.7

\-

\-

28.8

Flamingo-80B

\-

67.2

56.3

50.6

\-

\-

31.6

Unified-IO-XL

100.0

\-

77.9

54.0

\-

\-

\-

Kosmos-1

\-

67.1

51.0

\-

\-

\-

29.2

Kosmos-2

\-

66.7

45.6

\-

\-

\-

\-

BLIP-2 (Vicuna-13B)

103.9

71.6

65.0

45.9

32.3

61.0

19.6

InstructBLIP (Vicuna-13B)

**121.9**

82.8

\-

\-

49.5

63.1

33.4

Shikra (Vicuna-13B)

\-

73.9

77.36

47.16

\-

\-

\-

**Qwen-VL (Qwen-7B)**

121.4

**85.8**

**78.8**

**58.6**

**59.3**

67.1

35.2

Qwen-VL-Chat

120.2

81.0

78.2

56.6

57.5

**68.2**

**38.9**

Previous SOTA  
(Per Task Fine-tuning)

\-

127.0  
(PALI-17B)

84.5  
(InstructBLIP  
\-FlanT5-XL)

86.1  
(PALI-X  
\-55B)

66.1  
(PALI-X  
\-55B)

72.1  
(CFR)

92.53  
(LLaVa+  
GPT-4)

70.9  
(PALI-X  
\-55B)

*    Zero-shot Caption Qwen-VL  Flickr30K  **SOTA**  Nocaps  InstructBlip 
*    General VQA Qwen-VL  LVLM  **SOTA** 
*   For zero-shot image captioning, Qwen-VL achieves the **SOTA** on Flickr30K and competitive results on Nocaps with InstructBlip.
*   For general VQA, Qwen-VL achieves the **SOTA** under the same generalist LVLM scale settings.

### [](#-text-oriented-vqa) (Text-oriented VQA)

Model type

Model

TextVQA

DocVQA

ChartQA

AI2D

OCR-VQA

Generalist Models

BLIP-2 (Vicuna-13B)

42.4

\-

\-

\-

\-

InstructBLIP (Vicuna-13B)

50.7

\-

\-

\-

\-

mPLUG-DocOwl (LLaMA-7B)

52.6

62.2

57.4

\-

\-

Pic2Struct-Large (1.3B)

\-

**76.6**

58.6

42.1

71.3

Qwen-VL (Qwen-7B)

**63.8**

65.1

**65.7**

**62.3**

**75.7**

Specialist SOTAs  
(Specialist/Finetuned)

PALI-X-55B (Single-task FT)  
(Without OCR Pipeline)

71.44

80.0

70.0

81.2

75.0

*   / LVLM 
*    224  LVLM Qwen-VL  448Qwen-VL  1024  Pic2Struct-Large 
*   In text-related recognition/QA evaluation, Qwen-VL achieves the SOTA under the generalist LVLM scale settings.
*   Resolution is important for several above evaluations. While most open-source LVLM models with 224 resolution are incapable of these evaluations or can only solve these by cutting images, Qwen-VL scales the resolution to 448 so that it can be evaluated end-to-end. Qwen-VL even outperforms Pic2Struct-Large models of 1024 resolution on some tasks.

### [](#-referring-expression-comprehension) (Referring Expression Comprehension)

Model type

Model

RefCOCO

RefCOCO+

RefCOCOg

GRIT

val

test-A

test-B

val

test-A

test-B

val-u

test-u

refexp

Generalist Models

GPV-2

\-

\-

\-

\-

\-

\-

\-

\-

51.50

OFA-L\*

79.96

83.67

76.39

68.29

76.00

61.75

67.57

67.58

61.70

Unified-IO

\-

\-

\-

\-

\-

\-

\-

\-

**78.61**

VisionLLM-H

86.70

\-

\-

\-

\-

\-

\-

\-

Shikra-7B

87.01

90.61

80.24

81.60

87.36

72.12

82.27

82.19

69.34

Shikra-13B

87.83

91.11

81.81

82.89

87.79

74.41

82.64

83.16

69.03

Qwen-VL-7B

**89.36**

92.26

**85.34**

**83.12**

88.25

**77.21**

85.58

85.48

78.22

Qwen-VL-7B-Chat

88.55

**92.27**

84.51

82.82

**88.59**

76.79

**85.96**

**86.32**

\-

Specialist SOTAs  
(Specialist/Finetuned)

G-DINO-L

90.56

93.19

88.24

82.75

88.95

75.92

86.13

87.02

\-

UNINEXT-H

92.64

94.33

91.46

85.24

89.63

79.79

88.73

89.37

\-

ONE-PEACE

92.58

94.18

89.26

88.77

92.21

83.23

89.22

89.27

\-

*   Qwen-VL  Shikra-13B Generalist LVLM  Refcoco  **SOTA**
*   Qwen-VL  Caption   Grounding  Zero-shot  Grounding 

**** [eval/EVALUATION.md](/Qwen/Qwen-VL-Chat/blob/main/eval/EVALUATION.md) 

*   Qwen-VL achieves the **SOTA** in all above referring expression comprehension benchmarks.
*   Qwen-VL has not been trained on any Chinese grounding data, but it can still generalize to the Chinese Grounding tasks in a zero-shot way by training Chinese Caption data and English Grounding data.

We provide all of the above evaluation scripts for reproducing our experimental results. Please read [eval/EVALUATION.md](/Qwen/Qwen-VL-Chat/blob/main/eval/EVALUATION.md) for more information.

### [](#-chat-evaluation) (Chat Evaluation)

TouchStone  GPT4  LVLM  300+800+27**** TouchStone [touchstone/README\_CN.md](/Qwen/Qwen-VL-Chat/blob/main/touchstone/README_CN.md)

TouchStone is a benchmark based on scoring with GPT4 to evaluate the abilities of the LVLM model on text-image dialogue and alignment levels with humans. It covers a total of 300+ images, 800+ questions, and 27 categories, such as attribute-based Q&A, celebrity recognition, writing poetry, summarizing multiple images, product comparison, math problem solving, etc. Please read [touchstone/README\_CN.md](/Qwen/Qwen-VL-Chat/blob/main/touchstone/README.md) for more information.

#### [](#-english) (English)

Model

Score

PandaGPT

488.5

MiniGPT4

531.7

InstructBLIP

552.4

LLaMA-AdapterV2

590.1

mPLUG-Owl

605.4

LLaVA

602.7

Qwen-VL-Chat

645.2

#### [](#-chinese) (Chinese)

Model

Score

VisualGLM

247.1

Qwen-VL-Chat

401.2

Qwen-VL-Chat  LVLM 

Qwen-VL-Chat has achieved the best results in both Chinese and English alignment evaluation.  

[](#-faq) (FAQ)
-----------------------

 [FAQ](https://github.com/QwenLM/Qwen-VL/blob/master/FAQ_zh.md)issueissue

If you meet problems, please refer to [FAQ](https://github.com/QwenLM/Qwen-VL/blob/master/FAQ.md) and the issues first to search a solution before you launch a new issue.  

[](#-license-agreement) (License Agreement)
---------------------------------------------------

Qwen-VLQwen-VL-Chat[LICENSE](https://github.com/QwenLM/Qwen-VL/blob/master/LICENSE)[](https://dashscope.console.aliyun.com/openModelApply/qianwen)

Researchers and developers are free to use the codes and model weights of both Qwen-VL and Qwen-VL-Chat. We also allow their commercial use. Check our license at [LICENSE](/Qwen/Qwen-VL-Chat/blob/main/LICENSE) for more details.  

[](#-citation) (Citation)
-----------------------------

:star:  :pencil: :)

If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :)

    @article{Qwen-VL,
      title={Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities},
      author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
      journal={arXiv preprint arXiv:2308.12966},
      year={2023}
    }
    

  

[](#-contact-us) (Contact Us)
-------------------------------------

[qianwen\_opensource@alibabacloud.com](mailto:qianwen_opensource@alibabacloud.com)

If you are interested to leave a message to either our research team or product team, feel free to send an email to [qianwen\_opensource@alibabacloud.com](mailto:qianwen_opensource@alibabacloud.com).

## Model overview

`Qwen-VL-Chat` is a large vision language model proposed by Alibaba Cloud. It is the visual multimodal version of the Qwen (Tongyi Qianwen) large model series. `Qwen-VL-Chat` accepts image, text, and bounding box as inputs, and outputs text and bounding box. It is a more capable version of the base `Qwen-VL` model.

`Qwen-VL-Chat` is pretrained on large-scale data and can be used for a variety of vision-language tasks such as image captioning, visual question answering, and referring expression comprehension. Compared to the base `Qwen-VL` model, `Qwen-VL-Chat` has enhanced capabilities for interactive visual dialogue.

## Model inputs and outputs

### Inputs
- **Image**: An image in the form of a tensor
- **Text**: A textual prompt or dialogue history
- **Bounding box**: Locations of objects or regions of interest in the image

### Outputs
- **Text**: The model's generated response text
- **Bounding box**: Locations of objects or regions referred to in the output text

## Capabilities

`Qwen-VL-Chat` can perform a wide range of vision-language tasks, including:
- Image captioning: Generating descriptions for images
- Visual question answering: Answering questions about the content of images
- Referring expression comprehension: Localizing objects or regions in images based on textual referring expressions
- Visual dialogue: Engaging in back-and-forth conversations about images, by understanding the visual context and generating relevant responses

The model leverages both visual and textual information to produce more accurate and contextually appropriate outputs compared to models that only use text or vision alone.

## What can I use it for?

`Qwen-VL-Chat` can be used in a variety of applications that involve understanding and reasoning about visual information, such as:

- Intelligent image search and retrieval: Allowing users to search for and retrieve relevant images using natural language queries.
- Automated image captioning and description generation: Generating descriptive captions for images to aid accessibility or summarize visual content.
- Visual question answering: Building AI assistants that can answer questions about the contents of images.
- Interactive visual dialogue systems: Creating chatbots that can engage in back-and-forth conversations about images, answering follow-up questions and providing additional information.
- Multimodal content creation and editing: Assisting users in creating and manipulating visual content by understanding both the image and textual context.

These capabilities can be leveraged in a wide range of industries, such as e-commerce, education, entertainment, and more.

## Things to try

One interesting aspect of `Qwen-VL-Chat` is its ability to ground language in visual context and generate responses that are tailored to the specific image being discussed. For example, you could try providing the model with an image and a question about the contents of the image, and see how it leverages the visual information to provide a detailed and relevant answer.

Another interesting area to explore is the model's capacity for interactive visual dialogue. You could try engaging the model in a back-and-forth conversation about an image, asking follow-up questions or providing additional context, and observe how it updates its understanding and generates appropriate responses.

Additionally, you could experiment with using `Qwen-VL-Chat` for tasks like image captioning or referring expression comprehension, and compare its performance to other vision-language models. This could help you better understand the model's strengths and limitations in different applications.

[](#qwen15-72b-chat)Qwen1.5-72B-Chat
====================================

[](#introduction)Introduction
-----------------------------

Qwen1.5 is the beta version of Qwen2, a transformer-based decoder-only language model pretrained on a large amount of data. In comparison with the previous released Qwen, the improvements include:

*   8 model sizes, including 0.5B, 1.8B, 4B, 7B, 14B, 32B and 72B dense models, and an MoE model of 14B with 2.7B activated;
*   Significant performance improvement in human preference for chat models;
*   Multilingual support of both base and chat models;
*   Stable support of 32K context length for models of all sizes
*   No need of `trust_remote_code`.

For more details, please refer to our [blog post](https://qwenlm.github.io/blog/qwen1.5/) and [GitHub repo](https://github.com/QwenLM/Qwen1.5).  

[](#model-details)Model Details
-------------------------------

Qwen1.5 is a language model series including decoder language models of different model sizes. For each size, we release the base language model and the aligned chat model. It is based on the Transformer architecture with SwiGLU activation, attention QKV bias, group query attention, mixture of sliding window attention and full attention, etc. Additionally, we have an improved tokenizer adaptive to multiple natural languages and codes. For the beta version, temporarily we did not include GQA (except for 32B) and the mixture of SWA and full attention.

[](#training-details)Training details
-------------------------------------

We pretrained the models with a large amount of data, and we post-trained the models with both supervised finetuning and direct preference optimization.

[](#requirements)Requirements
-----------------------------

The code of Qwen1.5 has been in the latest Hugging face transformers and we advise you to install `transformers>=4.37.0`, or you might encounter the following error:

    KeyError: 'qwen2'
    

[](#quickstart)Quickstart
-------------------------

Here provides a code snippet with `apply_chat_template` to show you how to load the tokenizer and model and how to generate contents.

    from transformers import AutoModelForCausalLM, AutoTokenizer
    device = "cuda" # the device to load the model onto
    
    model = AutoModelForCausalLM.from_pretrained(
        "Qwen/Qwen1.5-72B-Chat",
        torch_dtype="auto",
        device_map="auto"
    )
    tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen1.5-72B-Chat")
    
    prompt = "Give me a short introduction to large language model."
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(device)
    
    generated_ids = model.generate(
        model_inputs.input_ids,
        max_new_tokens=512
    )
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]
    
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    

For quantized models, we advise you to use the GPTQ, AWQ, and GGUF correspondents, namely `Qwen1.5-72B-Chat-GPTQ-Int4`, `Qwen1.5-72B-Chat-GPTQ-Int8`, `Qwen1.5-72B-Chat-AWQ`, and `Qwen1.5-72B-Chat-GGUF`.

[](#tips)Tips
-------------

*   If you encounter code switching or other bad cases, we advise you to use our provided hyper-parameters in `generation_config.json`.

[](#citation)Citation
---------------------

If you find our work helpful, feel free to give us a cite.

    @article{qwen,
      title={Qwen Technical Report},
      author={Jinze Bai and Shuai Bai and Yunfei Chu and Zeyu Cui and Kai Dang and Xiaodong Deng and Yang Fan and Wenbin Ge and Yu Han and Fei Huang and Binyuan Hui and Luo Ji and Mei Li and Junyang Lin and Runji Lin and Dayiheng Liu and Gao Liu and Chengqiang Lu and Keming Lu and Jianxin Ma and Rui Men and Xingzhang Ren and Xuancheng Ren and Chuanqi Tan and Sinan Tan and Jianhong Tu and Peng Wang and Shijie Wang and Wei Wang and Shengguang Wu and Benfeng Xu and Jin Xu and An Yang and Hao Yang and Jian Yang and Shusheng Yang and Yang Yao and Bowen Yu and Hongyi Yuan and Zheng Yuan and Jianwei Zhang and Xingxuan Zhang and Yichang Zhang and Zhenru Zhang and Chang Zhou and Jingren Zhou and Xiaohuan Zhou and Tianhang Zhu},
      journal={arXiv preprint arXiv:2309.16609},
      year={2023}
    }

## Model Overview

`Qwen1.5-72B-Chat` is the beta version of the Qwen2 large language model, a transformer-based decoder-only model pretrained on a vast amount of data. Compared to the previous Qwen model, improvements include larger model sizes up to 72B parameters, significant performance gains in human preference for chat models, multilingual support, and stable support for 32K context length.

The [Qwen1.5-72B](https://aimodels.fyi/models/huggingFace/qwen15-72b-lucataco) model is another large 72B parameter version from the Qwen series, focused on general language modeling performance. In contrast, the `Qwen1.5-72B-Chat` model is specifically optimized for chatbot-style dialog.

## Model Inputs and Outputs

### Inputs
- **Text prompts**: The model accepts natural language text prompts as input, which can be questions, statements, or open-ended requests.
- **Chat history**: The model can also take in previous dialog context to continue a multi-turn conversation.

### Outputs
- **Generated text**: The primary output of the model is continuations of the input text, generating coherent and contextually relevant responses.
- **Multilingual support**: The model is capable of understanding and generating text in multiple languages, including Chinese, English, and others.

## Capabilities

The `Qwen1.5-72B-Chat` model exhibits strong performance across a variety of benchmarks, outperforming similarly-sized open-source models. It demonstrates robust capabilities in language understanding, reasoning, and generation, as evidenced by its high scores on evaluations like MMLU, C-Eval, and GSM8K.

The model also shows impressive abilities in tasks like code generation, with a HumanEval zero-shot pass@1 score of 37.2%. Additionally, it exhibits strong long-context understanding, achieving a VCSUM Rouge-L score of 16.6 on a long-form summarization dataset.

## What Can I Use It For?

The `Qwen1.5-72B-Chat` model can be a powerful tool for building advanced conversational AI applications. Its multilingual capabilities and strong performance on dialog-oriented benchmarks make it well-suited for developing intelligent chatbots, virtual assistants, and other language-based interfaces.

Potential use cases include customer service automation, personal productivity assistants, educational tutors, and creative writing aides. The model's broad knowledge and reasoning skills also enable it to assist with research, analysis, and problem-solving tasks across various domains.

## Things to Try

One interesting aspect of the `Qwen1.5-72B-Chat` model is its ability to utilize external tools and APIs through "ReAct Prompting". This allows the model to dynamically call upon relevant plugins or APIs to enhance its capabilities, such as performing web searches, accessing databases, or invoking specialized computational engines.

Developers could experiment with integrating the model into a broader system architecture that leverages these external capabilities, enabling the chatbot to provide more comprehensive and actionable responses to user queries. The model's strong performance on the HuggingFace Agent benchmark suggests it is well-suited for this type of hybrid AI approach.

[](#qwen-14b)Qwen-14B
=====================

![](https://qianwen-res.oss-cn-beijing.aliyuncs.com/logo_qwen.jpg)

  

 [Hugging Face](https://huggingface.co/Qwen) |  [ModelScope](https://modelscope.cn/organization/qwen) |   [Paper](https://arxiv.org/abs/2309.16609)    [Demo](https://modelscope.cn/studios/qwen/Qwen-14B-Chat-Demo/summary)  
[WeChat ()](https://github.com/QwenLM/Qwen/blob/main/assets/wechat.png) | [Discord](https://discord.gg/z3GAxXZ9Ce)  [API](https://dashscope.aliyun.com)

  

[](#-introduction) (Introduction)
-------------------------------------

**-14B****Qwen-14B**140Qwen-14BTransformer, Qwen-14BAIQwen-14B-ChatQwen-14B

-14BQwen-14B

1.  ****3tokens
2.  ****Qwen-14B
3.  ****Qwen-14B15

14B[GitHub](https://github.com/QwenLM/Qwen)

**Qwen-14B** is the 14B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-14B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-14B, we release Qwen-14B-Chat, a large-model-based AI assistant, which is trained with alignment techniques. This repository is the one for Qwen-14B.

The features of Qwen-14B include:

1.  **Large-scale high-quality training corpora**: It is pretrained on over 3 trillion tokens, including Chinese, English, multilingual texts, code, and mathematics, covering general and professional fields. The distribution of the pre-training corpus has been optimized through a large number of ablation experiments.
2.  **Competitive performance**: It significantly surpasses existing open-source models of similar scale on multiple Chinese and English downstream evaluation tasks (including commonsense, reasoning, code, mathematics, etc.), and even surpasses some larger-scale models in several benchmarks. See below for specific evaluation results.
3.  **More comprehensive vocabulary coverage**: Compared with other open-source models based on Chinese and English vocabularies, Qwen-14B uses a vocabulary of over 150K tokens. This vocabulary is more friendly to multiple languages, enabling users to directly further enhance the capability for certain languages without expanding the vocabulary.

For more details about the open-source model of Qwen-14B, please refer to the [GitHub](https://github.com/QwenLM/Qwen) code repository.  

[](#requirements)Requirements
-------------------------------------

*   python 3.8
*   pytorch 1.122.0
*   CUDA 11.4GPUflash-attention
*   python 3.8 and above
*   pytorch 1.12 and above, 2.0 and above are recommended
*   CUDA 11.4 and above are recommended (this is for GPU users, flash-attention users, etc.)  
    

[](#-dependency) (Dependency)
-----------------------------------

Qwen-14Bpip

To run Qwen-14B, please make sure you meet the above requirements, and then execute the following pip commands to install the dependent libraries.

    pip install transformers==4.32.0 accelerate tiktoken einops scipy transformers_stream_generator==0.0.4 peft deepspeed
    

`flash-attention`**flash attention 2**

In addition, it is recommended to install the `flash-attention` library (**we support flash attention 2 now.**) for higher efficiency and lower memory usage.

    git clone https://github.com/Dao-AILab/flash-attention
    cd flash-attention && pip install .
    # 
    # pip install csrc/layer_norm
    # pip install csrc/rotary
    

  

[](#quickstart)Quickstart
-------------------------------------



You can easily call the model with the following code:

    from transformers import AutoModelForCausalLM, AutoTokenizer
    from transformers.generation import GenerationConfig
    
    # Note: The default behavior now has injection attack prevention off.
    tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-14B", trust_remote_code=True)
    
    # use bf16
    # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-14B", device_map="auto", trust_remote_code=True, bf16=True).eval()
    # use fp16
    # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-14B", device_map="auto", trust_remote_code=True, fp16=True).eval()
    # use cpu only
    # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-14B", device_map="cpu", trust_remote_code=True).eval()
    # use auto mode, automatically select precision based on the device.
    model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-14B", device_map="auto", trust_remote_code=True).eval()
    
    # Specify hyperparameters for generation. But if you use transformers>=4.32.0, there is no need to do this.
    # model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-14B", trust_remote_code=True)
    
    inputs = tokenizer('Ulaanbaatar\nReykjavik\n', return_tensors='pt')
    inputs = inputs.to(model.device)
    pred = model.generate(**inputs)
    print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
    # Ulaanbaatar\nReykjavik\nAddis Ababa...
    

[GitHub repo](https://github.com/QwenLM/Qwen)

For more information, please refer to our [GitHub repo](https://github.com/QwenLM/Qwen) for more information.  

[](#tokenizer)Tokenizer
-----------------------

> tokenization

tiktokensentencepiecetokentokenizer[](https://github.com/QwenLM/Qwen/blob/main/tokenization_note_zh.md)

Our tokenizer based on tiktoken is different from other tokenizers, e.g., sentencepiece tokenizer. You need to pay attention to special tokens, especially in finetuning. For more detailed information on the tokenizer and related use in fine-tuning, please refer to the [documentation](https://github.com/QwenLM/Qwen/blob/main/tokenization_note.md).  

[](#-model) (Model)
---------------------------

Qwen-14B

The details of the model architecture of Qwen-14B are listed as follows:

Hyperparameter

Value

n\_layers

40

n\_heads

40

d\_model

5120

vocab size

151851

sequence length

2048

FFNnormalization RoPESwiGLURMSNormflash-attention

Qwen-14B15token GPT-4BPE`cl100k_base` [tiktoken](https://github.com/openai/tiktoken)

100100XLM-R1

Qwen-14Bthhearkovijatridplrunlptitdeesfr

Qwen-14B 3T tokens 

![](/Qwen/Qwen-14B/resolve/main/assets/tokenizer.png)

For position encoding, FFN activation function, and normalization methods, we adopt the prevalent practices, i.e., RoPE relative position encoding, SwiGLU for activation function, and RMSNorm for normalization (optional installation of flash-attention for acceleration).

For tokenization, compared to the current mainstream open-source models based on Chinese and English vocabularies, Qwen-14B uses a vocabulary of over 150K tokens. It first considers efficient encoding of Chinese, English, and code data, and is also more friendly to multilingual languages, enabling users to directly enhance the capability of some languages without expanding the vocabulary. It segments numbers by single digit, and calls the [tiktoken](https://github.com/openai/tiktoken) tokenizer library for efficient tokenization.

We randomly selected 1 million document corpus of each language to test and compare the encoding compression rates of different models (with XLM-R, which supports 100 languages, as the base value 1). The specific performance is shown in the figure above.

As can be seen, while ensuring the efficient decoding of Chinese, English, and code, Qwen-14B also achieves a high compression rate for many other languages (such as th, he, ar, ko, vi, ja, tr, id, pl, ru, nl, pt, it, de, es, fr etc.), equipping the model with strong scalability as well as high training and inference efficiency in these languages.

For pre-training data, on the one hand, Qwen-14B uses part of the open-source generic corpus. On the other hand, it uses a massive amount of accumulated web corpus and high-quality text content. The scale of corpus reaches over 3T tokens after deduplication and filtration, encompassing web text, encyclopedias, books, code, mathematics, and various domain.  

[](#evaluation)Evaluation
-------------------------------------

MMLUC-EvalGSM8K, MATH, HumanEval, MBPP, BBH, CMMLUbenchmarkQwenbenchmark

We selected MMLU, C-Eval, GSM8K, MATH, HumanEval, MBPP, BBH, CMMLU, which are currently popular benchmarks, to test the models Chinese and English knowledge capabilities, translation, mathematical reasoning, coding and other capabilities. From the following comprehensive evaluation results, we can see that the Qwen model outperform the similarly sized open-source models on all tasks.

Model

MMLU

C-Eval

GSM8K

MATH

HumanEval

MBPP

BBH

CMMLU

5-shot

5-shot

8-shot

4-shot

0-shot

3-shot

3-shot

5-shot

LLaMA2-7B

46.8

32.5

16.7

3.3

12.8

20.8

38.2

31.8

LLaMA2-13B

55.0

41.4

29.6

5.0

18.9

30.3

45.6

38.4

LLaMA2-34B

62.6

\-

42.2

6.2

22.6

33.0

44.1

\-

ChatGLM2-6B

47.9

51.7

32.4

6.5

\-

\-

33.7

\-

InternLM-7B

51.0

53.4

31.2

6.3

10.4

14.0

37.0

51.8

InternLM-20B

62.1

58.8

52.6

7.9

25.6

35.6

52.5

59.0

Baichuan2-7B

54.7

56.3

24.6

5.6

18.3

24.2

41.6

57.1

Baichuan2-13B

59.5

59.0

52.8

10.1

17.1

30.2

49.0

62.0

Qwen-7B (original)

56.7

59.6

51.6

\-

24.4

31.2

40.6

58.8

**Qwen-7B**

58.2

63.5

51.7

11.6

29.9

31.6

45.0

62.2

**Qwen-14B**

**66.3**

**72.1**

**61.3**

**24.8**

**32.3**

**40.8**

**53.4**

**71.0**

### [](#long-context-evaluation)Long-Context Evaluation

NTKLogNQwen-7B (original)14B2K8KQwen-7B8K32KarXivPPLQwen-7BQwen-14B

**(NTKLogNconfig.json`use_dynamic_ntk``use_logn_attn`true)**

We introduce NTK-aware interpolation, LogN attention scaling, Window attention, etc. to extend the context length to over 8K tokens. We conduct language modeling experiments on the arXiv dataset with the PPL evaluation. Results are demonstrated below:

**(To use NTK interpolation and LogN scaling, please set `use_dynamic_ntk` and `use_long_attn` to true in config.json.)**

Model

Sequence Length

1024

2048

4096

8192

16384

32768

Qwen-7B (original)

4.23

3.78

39.35

469.81

2645.09

\-

\+ dynamic\_ntk

4.23

3.78

3.59

3.66

5.71

\-

\+ dynamic\_ntk + logn

4.23

3.78

3.58

3.56

4.62

\-

\+ dynamic\_ntk + logn + window\_attn

4.23

3.78

3.58

3.49

4.32

\-

Qwen-7B

**4.23**

**3.81**

**3.52**

**3.31**

7.27

181.49

\+ dynamic\_ntk + logn + window\_attn

**4.23**

**3.81**

**3.52**

**3.33**

**3.22**

**3.17**

Qwen-14B

**\-**

**3.46**

22.79

334.65

3168.35

\-

\+ dynamic\_ntk + logn + window\_attn

**\-**

**3.46**

**3.29**

**3.18**

3.42

\-

[](#reproduction)Reproduction
-----------------------------------------

[](https://github.com/QwenLM/Qwen/tree/main/eval)

We have provided evaluation scripts to reproduce the performance of our model, details as [link](https://github.com/QwenLM/Qwen/tree/main/eval).  

[](#faq)FAQ
-----------

[FAQ](https://github.com/QwenLM/Qwen/blob/main/FAQ_zh.md)issueissue

If you meet problems, please refer to [FAQ](https://github.com/QwenLM/Qwen/blob/main/FAQ.md) and the issues first to search a solution before you launch a new issue.  

[](#-citation) (Citation)
-----------------------------



If you find our work helpful, feel free to give us a cite.

    @article{qwen,
      title={Qwen Technical Report},
      author={Jinze Bai and Shuai Bai and Yunfei Chu and Zeyu Cui and Kai Dang and Xiaodong Deng and Yang Fan and Wenbin Ge and Yu Han and Fei Huang and Binyuan Hui and Luo Ji and Mei Li and Junyang Lin and Runji Lin and Dayiheng Liu and Gao Liu and Chengqiang Lu and Keming Lu and Jianxin Ma and Rui Men and Xingzhang Ren and Xuancheng Ren and Chuanqi Tan and Sinan Tan and Jianhong Tu and Peng Wang and Shijie Wang and Wei Wang and Shengguang Wu and Benfeng Xu and Jin Xu and An Yang and Hao Yang and Jian Yang and Shusheng Yang and Yang Yao and Bowen Yu and Hongyi Yuan and Zheng Yuan and Jianwei Zhang and Xingxuan Zhang and Yichang Zhang and Zhenru Zhang and Chang Zhou and Jingren Zhou and Xiaohuan Zhou and Tianhang Zhu},
      journal={arXiv preprint arXiv:2309.16609},
      year={2023}
    }
    

  

[](#license-agreement)License Agreement
---------------------------------------------------

[LICENSE](https://github.com/QwenLM/Qwen/blob/main/Tongyi%20Qianwen%20LICENSE%20AGREEMENT)[](https://dashscope.console.aliyun.com/openModelApply/Qwen-14B-Chat)

Our code and checkpoints are open to research purpose, and they are allowed for commercial purposes. Check [LICENSE](https://github.com/QwenLM/Qwen/blob/main/Tongyi%20Qianwen%20LICENSE%20AGREEMENT) for more details about the license. If you have requirements for commercial use, please fill out the [form](https://dashscope.console.aliyun.com/openModelApply/Qwen-14B-Chat) to apply.  

[](#contact-us)Contact Us
-------------------------------------

Discord[qianwen\_opensource@alibabacloud.com](mailto:qianwen_opensource@alibabacloud.com)

If you are interested to leave a message to either our research team or product team, join our Discord or WeChat groups! Also, feel free to send an email to [qianwen\_opensource@alibabacloud.com](mailto:qianwen_opensource@alibabacloud.com).

## Model overview

`Qwen-14B` is the 14B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-14B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-14B, [Qwen-14B-Chat](https://aimodels.fyi/models/huggingFace/qwen-14b-chat-qwen) is released, a large-model-based AI assistant, which is trained with alignment techniques.

Qwen-14B features a large-scale high-quality training corpus of over 3 trillion tokens, covering Chinese, English, multilingual texts, code, and mathematics. It significantly surpasses existing open-source models of similar scale on multiple Chinese and English downstream evaluation tasks. Qwen-14B also uses a more comprehensive vocabulary of over 150K tokens, enabling users to directly enhance capabilities for certain languages without expanding the vocabulary.

## Model inputs and outputs

### Inputs
- **Text**: Qwen-14B accepts text input of up to 2048 tokens.

### Outputs
- **Text**: Qwen-14B generates text output in response to the input.

## Capabilities

Qwen-14B demonstrates competitive performance across a range of benchmarks. On the C-Eval Chinese evaluation, it achieves 69.8% zero-shot and 71.7% 5-shot accuracy, outperforming similarly-sized models. On MMLU, its zero-shot and 5-shot English evaluation accuracy reaches 64.6% and 66.5% respectively. Qwen-14B also performs well on coding tasks, scoring 43.9% on the HumanEval zero-shot benchmark, and 60.1% on the zero-shot GSM8K mathematics evaluation.

## What can I use it for?

The large scale and broad capabilities of Qwen-14B make it suitable for a variety of natural language processing tasks. Potential use cases include:

- **Content generation**: Qwen-14B can be used to generate high-quality text on a wide range of topics, from creative writing to technical documentation.
- **Conversational AI**: Building on the Qwen-14B-Chat model, developers can create advanced chatbots and virtual assistants.
- **Multilingual support**: The model's comprehensive vocabulary allows it to handle multiple languages, enabling cross-lingual applications.
- **Code generation and reasoning**: Qwen-14B's strong performance on coding and math tasks makes it useful for programming-related applications.

## Things to try

One interesting aspect of Qwen-14B is its ability to handle long-form text. By incorporating techniques like NTK-aware interpolation and LogN attention scaling, the model can maintain strong performance even on sequences up to 32,768 tokens long. Developers could explore leveraging this capability for tasks like long-form summarization or knowledge-intensive QA.

Another intriguing area to experiment with is Qwen-14B's tool usage capabilities. The model supports ReAct prompting, allowing it to interact with external plugins and APIs. This could enable the development of intelligent assistants that can seamlessly integrate diverse functionalities.

[](#codeqwen15-7b-chat)CodeQwen1.5-7B-Chat
==========================================

[](#introduction)Introduction
-----------------------------

CodeQwen1.5 is the Code-Specific version of Qwen1.5. It is a transformer-based decoder-only language model pretrained on a large amount of data of codes.

*   Strong code generation capabilities and competitve performance across a series of benchmarks;
*   Supporting long context understanding and generation with the context length of 64K tokens;
*   Supporting 92 coding languages
*   Excellent performance in text-to-SQL, bug fix, etc.

For more details, please refer to our [blog post](https://qwenlm.github.io/blog/codeqwen1.5/) and [GitHub repo](https://github.com/QwenLM/Qwen1.5).

[](#model-details)Model Details
-------------------------------

CodeQwen1.5 is based on Qwen1.5, a language model series including decoder language models of different model sizes. It is trained on 3 trillion tokens of data of codes, and it includes group query attention (GQA) for efficient inference.

[](#requirements)Requirements
-----------------------------

The code of Qwen1.5 has been in the latest Hugging face transformers and we advise you to install `transformers>=4.37.0`, or you might encounter the following error:

    KeyError: 'qwen2'.
    

[](#quickstart)Quickstart
-------------------------

Here provides a code snippet with `apply_chat_template` to show you how to load the tokenizer and model and how to generate contents.

    from transformers import AutoModelForCausalLM, AutoTokenizer
    device = "cuda" # the device to load the model onto
    
    model = AutoModelForCausalLM.from_pretrained(
        "Qwen/CodeQwen1.5-7B-Chat",
        torch_dtype="auto",
        device_map="auto"
    )
    tokenizer = AutoTokenizer.from_pretrained("Qwen/CodeQwen1.5-7B-Chat")
    
    prompt = "Write a quicksort algorithm in python."
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(device)
    
    generated_ids = model.generate(
        model_inputs.input_ids,
        max_new_tokens=512
    )
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]
    
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    

[](#tips)Tips
-------------

*   If you encounter code switching or other bad cases, we advise you to use our provided hyper-parameters in `generation_config.json`.

[](#citation)Citation
---------------------

If you find our work helpful, feel free to give us a cite.

    @article{qwen,
      title={Qwen Technical Report},
      author={Jinze Bai and Shuai Bai and Yunfei Chu and Zeyu Cui and Kai Dang and Xiaodong Deng and Yang Fan and Wenbin Ge and Yu Han and Fei Huang and Binyuan Hui and Luo Ji and Mei Li and Junyang Lin and Runji Lin and Dayiheng Liu and Gao Liu and Chengqiang Lu and Keming Lu and Jianxin Ma and Rui Men and Xingzhang Ren and Xuancheng Ren and Chuanqi Tan and Sinan Tan and Jianhong Tu and Peng Wang and Shijie Wang and Wei Wang and Shengguang Wu and Benfeng Xu and Jin Xu and An Yang and Hao Yang and Jian Yang and Shusheng Yang and Yang Yao and Bowen Yu and Hongyi Yuan and Zheng Yuan and Jianwei Zhang and Xingxuan Zhang and Yichang Zhang and Zhenru Zhang and Chang Zhou and Jingren Zhou and Xiaohuan Zhou and Tianhang Zhu},
      journal={arXiv preprint arXiv:2309.16609},
      year={2023}
    }

## Model overview

`CodeQwen1.5-7B-Chat` is a transformer-based language model developed by Qwen. It is a code-specific version of the larger Qwen1.5 model series, which includes language models of various sizes. `CodeQwen1.5-7B-Chat` is trained on a large amount of code data and excels at tasks like text-to-SQL, bug fixing, and more. Compared to the original Qwen1.5 model, `CodeQwen1.5-7B-Chat` has strong code generation capabilities and can handle long contexts of up to 64K tokens across 92 coding languages.

## Model inputs and outputs

### Inputs
- **Text**: `CodeQwen1.5-7B-Chat` can accept text inputs for various code-related tasks, such as prompts for code generation, text-to-SQL, and bug fixes.

### Outputs
- **Text**: The model generates text outputs, which can include code, SQL queries, or natural language responses related to the input.

## Capabilities

`CodeQwen1.5-7B-Chat` demonstrates impressive performance across a range of benchmarks, including text-to-SQL, bug fixing, and more. It can generate high-quality code and maintain coherence over long contexts of up to 64K tokens.

## What can I use it for?

`CodeQwen1.5-7B-Chat` can be a valuable tool for developers and data analysts who need assistance with code-related tasks. It can be used to generate code snippets, fix bugs, translate natural language to SQL queries, and more. The model's strong performance and ability to handle long contexts make it well-suited for complex, multi-step coding and data analysis projects.

## Things to try

One interesting aspect of `CodeQwen1.5-7B-Chat` is its support for a wide range of coding languages, which allows users to directly enhance the model's capabilities in specific languages without the need to expand the vocabulary. This can be particularly useful for developers working in less common programming languages or those who need multilingual support for their projects.