[](#qwen2-72b)Qwen2-72B
=======================

[](#introduction)Introduction
-----------------------------

Qwen2 is the new series of Qwen large language models. For Qwen2, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters, including a Mixture-of-Experts model. This repo contains the 72B Qwen2 base language model.

Compared with the state-of-the-art opensource language models, including the previous released Qwen1.5, Qwen2 has generally surpassed most opensource models and demonstrated competitiveness against proprietary models across a series of benchmarks targeting for language understanding, language generation, multilingual capability, coding, mathematics, reasoning, etc.

For more details, please refer to our [blog](https://qwenlm.github.io/blog/qwen2/), [GitHub](https://github.com/QwenLM/Qwen2), and [Documentation](https://qwen.readthedocs.io/en/latest/).  

[](#model-details)Model Details
-------------------------------

Qwen2 is a language model series including decoder language models of different model sizes. For each size, we release the base language model and the aligned chat model. It is based on the Transformer architecture with SwiGLU activation, attention QKV bias, group query attention, etc. Additionally, we have an improved tokenizer adaptive to multiple natural languages and codes.

[](#requirements)Requirements
-----------------------------

The code of Qwen2 has been in the latest Hugging face transformers and we advise you to install `transformers>=4.37.0`, or you might encounter the following error:

    KeyError: 'qwen2'
    

[](#usage)Usage
---------------

We do not advise you to use base language models for text generation. Instead, you can apply post-training, e.g., SFT, RLHF, continued pretraining, etc., on this model.

[](#performance)Performance
---------------------------

The evaluation of base models mainly focuses on the model performance of natural language understanding, general question answering, coding, mathematics, scientific knowledge, reasoning, multilingual capability, etc.

The datasets for evaluation include:

**English Tasks**: MMLU (5-shot), MMLU-Pro (5-shot), GPQA (5shot), Theorem QA (5-shot), BBH (3-shot), HellaSwag (10-shot), Winogrande (5-shot), TruthfulQA (0-shot), ARC-C (25-shot)

**Coding Tasks**: EvalPlus (0-shot) (HumanEval, MBPP, HumanEval+, MBPP+), MultiPL-E (0-shot) (Python, C++, JAVA, PHP, TypeScript, C#, Bash, JavaScript)

**Math Tasks**: GSM8K (4-shot), MATH (4-shot)

**Chinese Tasks**: C-Eval (5-shot), CMMLU (5-shot)

**Multilingual Tasks**: Multi-Exam (M3Exam 5-shot, IndoMMLU 3-shot, ruMMLU 5-shot, mMMLU 5-shot), Multi-Understanding (BELEBELE 5-shot, XCOPA 5-shot, XWinograd 5-shot, XStoryCloze 0-shot, PAWS-X 5-shot), Multi-Mathematics (MGSM 8-shot), Multi-Translation (Flores-101 5-shot)

#### [](#qwen2-72b-performance)Qwen2-72B performance

Datasets

DeepSeek-V2

Mixtral-8x22B

Llama-3-70B

Qwen1.5-72B

Qwen1.5-110B

**Qwen2-72B**

Architecture

MoE

MoE

Dense

Dense

Dense

Dense

#Activated Params

21B

39B

70B

72B

110B

72B

#Params

236B

140B

70B

72B

110B

72B

_**English**_

MMLU

78.5

77.8

79.5

77.5

80.4

**84.2**

MMLU-Pro

\-

49.5

52.8

45.8

49.4

**55.6**

GPQA

\-

34.3

36.3

36.3

35.9

**37.9**

Theorem QA

\-

35.9

32.3

29.3

34.9

**43.1**

BBH

78.9

78.9

81.0

65.5

74.8

**82.4**

HellaSwag

87.8

**88.7**

88.0

86.0

87.5

87.6

WindoGrande

84.8

85.0

**85.3**

83.0

83.5

85.1

ARC-C

70.0

**70.7**

68.8

65.9

69.6

68.9

TruthfulQA

42.2

51.0

45.6

**59.6**

49.6

54.8

_**Coding**_

HumanEval

45.7

46.3

48.2

46.3

54.3

**64.6**

MBPP

73.9

71.7

70.4

66.9

70.9

**76.9**

EvalPlus

55.0

54.1

54.8

52.9

57.7

**65.4**

MultiPL-E

44.4

46.7

46.3

41.8

52.7

**59.6**

_**Mathematics**_

GSM8K

79.2

83.7

83.0

79.5

85.4

**89.5**

MATH

43.6

41.7

42.5

34.1

49.6

**51.1**

_**Chinese**_

C-Eval

81.7

54.6

65.2

84.1

89.1

**91.0**

CMMLU

84.0

53.4

67.2

83.5

88.3

**90.1**

_**Multilingual**_

Mulit-Exam

67.5

63.5

70.0

66.4

75.6

**76.6**

Multi-Understanding

77.0

77.7

79.9

78.2

78.2

**80.7**

Multi-Mathematics

58.8

62.9

67.1

61.7

64.4

**76.0**

Multi-Translation

36.0

23.3

**38.0**

35.6

36.2

37.8

[](#citation)Citation
---------------------

If you find our work helpful, feel free to give us a cite.

    @article{qwen2,
      title={Qwen2 Technical Report},
      year={2024}
    }

## Model overview

The `Qwen2-72B` is a large-scale language model developed by Qwen, a team at Alibaba Cloud. It is part of the Qwen series of language models, which includes models ranging from 0.5 to 72 billion parameters. Compared to other open-source language models, Qwen2-72B has demonstrated strong performance across a variety of benchmarks targeting language understanding, generation, multilingual capability, coding, mathematics, and reasoning.

The model is based on the Transformer architecture and includes features like SwiGLU activation, attention QKV bias, group query attention, and an improved tokenizer that is adaptive to multiple natural languages and codes. Qwen2-72B has a large vocabulary of over 150,000 tokens, which enables efficient encoding of Chinese, English, and code data, as well as strong support for a wide range of other languages.

Similar to other models in the Qwen series, Qwen2-72B is a decoder-only language model that is not recommended for direct text generation. Instead, Qwen suggests applying techniques like supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), or continued pretraining to further enhance the model's capabilities.

## Model inputs and outputs

### Inputs
- The model takes in text input, which can be in a variety of languages including Chinese, English, and multilingual text.

### Outputs
- The model generates text output, which can be used for a variety of natural language processing tasks such as language understanding, generation, translation, and more.

## Capabilities

Qwen2-72B has demonstrated strong performance on a wide range of benchmarks, including commonsense reasoning, mathematical reasoning, coding, and multilingual tasks. For example, on the MMLU (Multi-Model Language Understanding) benchmark, Qwen2-72B achieved an average score of 77.4%, outperforming other large language models like [Qwen-72B](https://aimodels.fyi/models/huggingFace/qwen-72b-qwen) and [Qwen1.5-72B](https://aimodels.fyi/models/huggingFace/qwen15-72b-qwen). The model also showed impressive performance on coding tasks like HumanEval and MBPP, as well as mathematical reasoning tasks like GSM8K and MATH.

## What can I use it for?

The Qwen2-72B model can be used for a variety of natural language processing tasks, such as:

- **Text generation**: While the model is not recommended for direct text generation, it can be fine-tuned or used as a base for developing more specialized language models for tasks like content creation, dialogue systems, or summarization.
- **Language understanding**: The model's strong performance on benchmarks like MMLU suggests it can be useful for tasks like question answering, textual entailment, and other language understanding applications.
- **Multilingual applications**: The model's broad vocabulary and support for multiple languages make it well-suited for developing multilingual applications, such as translation systems or cross-lingual information retrieval.
- **Code-related tasks**: Given the model's strong performance on coding-related benchmarks, it could be leveraged for tasks like code generation, code summarization, or code understanding.

## Things to try

One interesting aspect of the Qwen2-72B model is its ability to handle long-context input. The model supports a context length of up to 32,768 tokens, which is significantly longer than many other language models. This makes it well-suited for tasks that require understanding and reasoning over long passages of text, such as summarization, question answering, or document-level language modeling.

Another interesting area to explore would be the model's performance on specialized domains or tasks, such as scientific or technical writing, legal reasoning, or financial analysis. By fine-tuning the model on domain-specific data, researchers and developers may be able to unlock additional capabilities and insights.