The burgeoning interest in developing Large Language Models (LLMs) with up to trillion parameters has been met with concerns regarding resource efficiency and practical expense, particularly given the immense cost of experimentation. This scenario underscores the importance of exploring the potential of Small Language Models (SLMs) as a resource-efficient alternative. In this context, we introduce MiniCPM, specifically the 1.2B and 2.4B non-embedding parameter variants, not only excel in their respective categories but also demonstrate capabilities on par with 7B-13B LLMs. While focusing on SLMs, our approach exhibits scalability in both model and data dimensions for future LLM research. Regarding model scaling, we employ extensive model wind tunnel experiments for stable and optimal scaling. For data scaling, we introduce a Warmup-Stable-Decay (WSD) learning rate scheduler (LRS), conducive to continuous training and domain adaptation. We present an in-depth analysis of the intriguing training dynamics that occurred in the WSD LRS. With WSD LRS, we are now able to efficiently study data-model scaling law without extensive retraining experiments on both axes of model and data, from which we derive the much higher compute optimal data-model ratio than Chinchilla Optimal. Additionally, we introduce MiniCPM family, including MiniCPM-DPO, MiniCPM-MoE and MiniCPM-128K, whose excellent performance further cementing MiniCPM's foundation in diverse SLM applications. MiniCPM models are available publicly at https://github.com/OpenBMB/MiniCPM .

## Overview

- This paper introduces MiniCPM, a new approach for training small language models to unlock their potential.
- The researchers developed scalable training strategies to efficiently train compact models without compromising performance.
- MiniCPM models demonstrate strong results on a variety of benchmarks, showcasing the viability of small, cost-effective language models.

## Plain English Explanation

The researchers behind this paper have developed a new way to train small language models, called MiniCPM. Language models are large artificial intelligence systems that can understand and generate human-like text. They are typically very large and expensive to train, which limits their accessibility.

The goal of this work was to show that small, compact language models can still perform well if trained effectively. The researchers developed special training strategies to efficiently train these smaller models without sacrificing their capabilities. Through extensive experiments, they demonstrated that MiniCPM models can achieve strong results on a range of benchmarks, rivaling the performance of much larger and more resource-intensive models.

This is an important advancement because it opens the door for more affordable and accessible language AI systems. Small models require less computing power and are cheaper to develop, allowing a wider range of organizations and individuals to take advantage of this technology. By unleashing the potential of small language models, this research could enable new applications and wider adoption of natural language AI.

## Technical Explanation

The core innovation introduced in this paper is the MiniCPM framework, which allows for the scalable training of small language models. The researchers developed specialized training techniques, including layerwise training, progressive scaling, and selective parameter sharing, to efficiently learn compact model architectures.

Through extensive "model wind tunnel" experiments, the team evaluated MiniCPM models of varying sizes on a diverse set of language understanding and generation benchmarks. The results show that MiniCPM models are able to achieve strong performance, often matching or exceeding the capabilities of much larger language models.

Notably, the researchers found that MiniCPM models exhibit favorable scaling properties, where doubling the model size leads to consistent performance improvements. This suggests that the training strategies are effective at extracting maximal capability from small-scale models.

The paper also investigates the role of model depth and width, demonstrating that depth is a more critical factor than width for achieving high performance in compact language models. This provides valuable insights for designing efficient model architectures.

## Critical Analysis

The researchers acknowledge several limitations and areas for future work. For example, they note that MiniCPM models may struggle with tasks that require extensive world knowledge or reasoning abilities, as their compact nature inherently limits the information they can store.

Additionally, the paper does not explore the performance of MiniCPM models on real-world applications, such as dialogue systems or content generation. Further research is needed to understand how these small models would fare in practical, end-to-end deployments.

Another potential concern is the environmental impact of training numerous small models, as the cumulative energy consumption could still be significant. The paper does not address the carbon footprint or sustainability implications of this approach.

Despite these caveats, the MiniCPM framework represents an important step forward in making language AI more accessible and scalable. By unlocking the potential of small models, this work paves the way for more affordable and widespread adoption of natural language processing technologies.

## Conclusion

This paper introduces MiniCPM, a novel approach for training small language models that can rival the performance of much larger and more resource-intensive systems. Through innovative training strategies, the researchers were able to extract maximal capability from compact model architectures, opening up new possibilities for cost-effective and accessible natural language AI.

The strong results demonstrated on a range of benchmarks suggest that MiniCPM could enable a new generation of language models that are more widely deployable and impactful. As the field of natural language processing continues to evolve, this research represents an important contribution towards making advanced language technologies more attainable and scalable.