[](#wizardmath-empowering-mathematical-reasoning-for-large-language-models-via-reinforced-evol-instruct-rleif)WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct (RLEIF)
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

 [Home Page](https://wizardlm.github.io/)

 [HF Repo](https://huggingface.co/WizardLM)  [Github Repo](https://github.com/nlpxucan/WizardLM)   [Twitter](https://twitter.com/WizardLM_AI)

 [\[WizardLM\]](https://arxiv.org/abs/2304.12244)   [\[WizardCoder\]](https://arxiv.org/abs/2306.08568)   [\[WizardMath\]](https://arxiv.org/abs/2308.09583)  

 Join our [Discord](https://discord.gg/VZjjHtWrKs)

[](#news)News
-------------

\[12/19/2023\]  We released **WizardMath-7B-V1.1** trained from Mistral-7B, the **SOTA 7B math LLM**, achieves **83.2 pass@1** on GSM8k, and **33.0 pass@1** on MATH.

\[12/19/2023\]  **WizardMath-7B-V1.1** outperforms **ChatGPT 3.5**, **Gemini Pro**, **Mixtral MOE**, and **Claude Instant** on GSM8K pass@1.

\[12/19/2023\]  **WizardMath-7B-V1.1** is comparable with **ChatGPT 3.5**, **Gemini Pro**, and surpasses **Mixtral MOE** on MATH pass@1.

Model

Checkpoint

Paper

GSM8k

MATH

**WizardMath-7B-V1.1**

 [HF Link](https://huggingface.co/WizardLM/WizardMath-7B-V1.1)

 [\[WizardMath\]](https://arxiv.org/abs/2308.09583)

**83.2**

**33.0**

WizardMath-70B-V1.0

 [HF Link](https://huggingface.co/WizardLM/WizardMath-70B-V1.0)

 [\[WizardMath\]](https://arxiv.org/abs/2308.09583)

**81.6**

**22.7**

WizardMath-13B-V1.0

 [HF Link](https://huggingface.co/WizardLM/WizardMath-13B-V1.0)

 [\[WizardMath\]](https://arxiv.org/abs/2308.09583)

**63.9**

**14.0**

WizardMath-7B-V1.0

 [HF Link](https://huggingface.co/WizardLM/WizardMath-7B-V1.0)

 [\[WizardMath\]](https://arxiv.org/abs/2308.09583)

**54.9**

**10.7**

[](#12192023-comparing-wizardmath-7b-v11-with-other-open-source-7b-size-math-llms)\[12/19/2023\] Comparing WizardMath-7B-V1.1 with other open source 7B size math LLMs.
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------

Model

GSM8k Pass@1

MATH Pass@1

MPT-7B

6.8

3.0

Llama 1-7B

11.0

2.9

Llama 2-7B

12.3

2.8

Yi-6b

32.6

5.8

Mistral-7B

37.8

9.1

Qwen-7b

47.8

9.3

RFT-7B

50.3

\--

MAmmoTH-7B (COT)

50.5

10.4

WizardMath-7B-V1.0

54.9

10.7

Abel-7B-001

59.7

13

MetaMath-7B

66.5

19.8

Arithmo-Mistral-7B

74.7

25.3

MetaMath-Mistral-7B

77.7

28.2

Abel-7B-002

80.4

29.5

**WizardMath-7B-V1.1**

**83.2**

**33.0**

[](#12192023-comparing-wizardmath-7b-v11-with-large-open-source-30b70b-llms)\[12/19/2023\] Comparing WizardMath-7B-V1.1 with large open source (30B~70B) LLMs.
--------------------------------------------------------------------------------------------------------------------------------------------------------------

Model

GSM8k Pass@1

MATH Pass@1

Llemma-34B

51.5

25.0

Minerva-62B

52.4

27.6

Llama 2-70B

56.8

13.5

DeepSeek 67B

63.4

\--

Gork 33B

62.9

23.9

MAmmoTH-70B

72.4

21.1

Yi-34B

67.9

15.9

Mixtral 8x7B

74.4

28.4

MetaMath-70B

82.3

26.6

**WizardMath-7B-V1.1**

**83.2**

**33.0**

[](#-data-contamination-check) Data Contamination Check:
----------------------------------------------------------

Before model training, we carefully and rigorously checked all the training data, and used multiple deduplication methods to verify and prevent data leakage on GSM8k and MATH test set.

Model

Checkpoint

Paper

MT-Bench

AlpacaEval

GSM8k

HumanEval

License

**WizardLM-70B-V1.0**

 [HF Link](https://huggingface.co/WizardLM/WizardLM-70B-V1.0)

**Coming Soon**

**7.78**

**92.91%**

**77.6%**

**50.6 pass@1**

[Llama 2 License](https://ai.meta.com/resources/models-and-libraries/llama-downloads/)

WizardLM-13B-V1.2

 [HF Link](https://huggingface.co/WizardLM/WizardLM-13B-V1.2)

7.06

89.17%

55.3%

36.6 pass@1

[Llama 2 License](https://ai.meta.com/resources/models-and-libraries/llama-downloads/)

WizardLM-13B-V1.1

 [HF Link](https://huggingface.co/WizardLM/WizardLM-13B-V1.1)

6.76

86.32%

25.0 pass@1

Non-commercial

WizardLM-30B-V1.0

 [HF Link](https://huggingface.co/WizardLM/WizardLM-30B-V1.0)

7.01

37.8 pass@1

Non-commercial

WizardLM-13B-V1.0

 [HF Link](https://huggingface.co/WizardLM/WizardLM-13B-V1.0)

6.35

75.31%

24.0 pass@1

Non-commercial

WizardLM-7B-V1.0

 [HF Link](https://huggingface.co/WizardLM/WizardLM-7B-V1.0)

 [\[WizardLM\]](https://arxiv.org/abs/2304.12244)

19.1 pass@1

Non-commercial

Model

Checkpoint

Paper

HumanEval

MBPP

Demo

License

WizardCoder-Python-34B-V1.0

 [HF Link](https://huggingface.co/WizardLM/WizardCoder-Python-34B-V1.0)

 [\[WizardCoder\]](https://arxiv.org/abs/2306.08568)

73.2

61.2

[Demo](http://47.103.63.15:50085/)

[Llama2](https://ai.meta.com/resources/models-and-libraries/llama-downloads/)

WizardCoder-15B-V1.0

 [HF Link](https://huggingface.co/WizardLM/WizardCoder-15B-V1.0)

 [\[WizardCoder\]](https://arxiv.org/abs/2306.08568)

59.8

50.6

\--

[OpenRAIL-M](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement)

WizardCoder-Python-13B-V1.0

 [HF Link](https://huggingface.co/WizardLM/WizardCoder-Python-13B-V1.0)

 [\[WizardCoder\]](https://arxiv.org/abs/2306.08568)

64.0

55.6

\--

[Llama2](https://ai.meta.com/resources/models-and-libraries/llama-downloads/)

WizardCoder-Python-7B-V1.0

 [HF Link](https://huggingface.co/WizardLM/WizardCoder-Python-7B-V1.0)

 [\[WizardCoder\]](https://arxiv.org/abs/2306.08568)

55.5

51.6

[Demo](http://47.103.63.15:50088/)

[Llama2](https://ai.meta.com/resources/models-and-libraries/llama-downloads/)

WizardCoder-3B-V1.0

 [HF Link](https://huggingface.co/WizardLM/WizardCoder-3B-V1.0)

 [\[WizardCoder\]](https://arxiv.org/abs/2306.08568)

34.8

37.4

\--

[OpenRAIL-M](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement)

WizardCoder-1B-V1.0

 [HF Link](https://huggingface.co/WizardLM/WizardCoder-1B-V1.0)

 [\[WizardCoder\]](https://arxiv.org/abs/2306.08568)

23.8

28.6

\--

[OpenRAIL-M](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement)

**Github Repo**: [https://github.com/nlpxucan/WizardLM/tree/main/WizardMath](https://github.com/nlpxucan/WizardLM/tree/main/WizardMath)

**Twitter**: [https://twitter.com/WizardLM\_AI/status/1689998428200112128](https://twitter.com/WizardLM_AI/status/1689998428200112128)

**Discord**: [https://discord.gg/VZjjHtWrKs](https://discord.gg/VZjjHtWrKs)

[](#comparing-wizardmath-v10-with-other-llms)Comparing WizardMath-V1.0 with Other LLMs.
---------------------------------------------------------------------------------------

 The following figure shows that our **WizardMath-70B-V1.0 attains the fifth position in this benchmark**, surpassing ChatGPT (81.6 vs. 80.8) , Claude Instant (81.6 vs. 80.9), PaLM 2 540B (81.6 vs. 80.7).

![WizardMath](https://raw.githubusercontent.com/nlpxucan/WizardLM/main/WizardMath/images/wizardmath_gsm8k.png)

**Note for model system prompts usage:**

Please use **the same systems prompts strictly** with us, and we do not guarantee the accuracy of the **quantified versions**.

**Default version:**

    "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response:"
    

**CoT Version:** For the **simple** math questions, we do NOT recommend to use the CoT prompt.

    "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response: Let's think step by step."
    

[](#inference-wizardmath-demo-script)Inference WizardMath Demo Script
---------------------------------------------------------------------

We provide the WizardMath inference demo code [here](https://github.com/nlpxucan/WizardLM/tree/main/demo).

**To commen concern about dataset:**

Recently, there have been clear changes in the open-source policy and regulations of our overall organization's code, data, and models. Despite this, we have still worked hard to obtain opening the weights of the model first, but the data involves stricter auditing and is in review with our legal team . Our researchers have no authority to publicly release them without authorization. Thank you for your understanding.

[](#citation)Citation
---------------------

Please cite the repo if you use the data, method or code in this repo.

    @article{luo2023wizardmath,
      title={WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct},
      author={Luo, Haipeng and Sun, Qingfeng and Xu, Can and Zhao, Pu and Lou, Jianguang and Tao, Chongyang and Geng, Xiubo and Lin, Qingwei and Chen, Shifeng and Zhang, Dongmei},
      journal={arXiv preprint arXiv:2308.09583},
      year={2023}
    }

## Model overview

The `WizardMath-70B-V1.0` model is a large language model developed by the WizardLM team that is focused on empowering mathematical reasoning capabilities. It was trained using a novel method called Reinforced Evol-Instruct (RLEIF), which involves automatically generating a diverse set of math-related instructions to fine-tune the model.

The model is one of several in the WizardMath series, which also includes smaller 13B and 7B versions. Compared to other open-source math LLMs, the WizardMath-70B-V1.0 model significantly outperforms on key benchmarks like the GSM8k and MATH datasets, achieving 81.6 pass@1 and 22.7 pass@1 respectively. This puts it ahead of state-of-the-art models like ChatGPT 3.5, Claude Instant 1, and PaLM 2 540B.

## Model inputs and outputs

### Inputs
- **Natural language instructions**: The model takes in text-based instructions or prompts related to math problems or reasoning tasks.

### Outputs
- **Textual responses**: The model generates text-based responses that attempt to solve the given math problem or provide a reasoned answer.

## Capabilities

The `WizardMath-70B-V1.0` model demonstrates strong capabilities in mathematical reasoning and problem-solving. It can tackle a wide range of math-related tasks, from simple arithmetic to more complex algebra, geometry, and even calculus problems. The model is particularly adept at step-by-step reasoning, clearly explaining its thought process and showing its work.

## What can I use it for?

The `WizardMath-70B-V1.0` model could be useful for a variety of applications that require advanced mathematical skills, such as:

- Providing homework help and tutoring for students struggling with math
- Automating the generation of math practice problems and solutions
- Integrating math reasoning capabilities into educational apps and games
- Aiding in the development of math-focused AI assistants

## Things to try

One interesting aspect of the `WizardMath-70B-V1.0` model is its ability to handle multi-step math problems. Try providing it with complex word problems or story-based math questions and see how it breaks down the problem and arrives at the solution. You can also experiment with prompting the model to explain its reasoning in more detail or to explore alternative solution approaches.