# Chinchilla Scaling: A replication attempt

2404.10102

124

0

## Abstract

Hoffmann et al. (2022) propose three methods for estimating a compute-optimal scaling law. We attempt to replicate their third estimation procedure, which involves fitting a parametric loss function to a reconstruction of data from their plots. We find that the reported estimates are inconsistent with their first two estimation methods, fail at fitting the extracted data, and report implausibly narrow confidence intervals--intervals this narrow would require over 600,000 experiments, while they likely only ran fewer than 500. In contrast, our rederivation of the scaling law using the third approach yields results that are compatible with the findings from the first two estimation procedures described by Hoffmann et al.

Create account to get full access

## Overview

- This paper is a replication attempt of the "Chinchilla Scaling" research presented in the paper "Unraveling the mystery of neural scaling laws" by Hoffmann et al.
- The authors aim to validate the findings of the original paper by extracting data from their Figure 4 and attempting to replicate their "Approach 3" scaling analysis.
- The results provide insights into the reliability and generalizability of the Chinchilla Scaling phenomenon observed in large language models.

## Plain English Explanation

The paper focuses on replicating a previous study that explored the "Chinchilla Scaling" relationship, which describes how the performance of large language models improves as they are trained on more data and have more parameters. The researchers in this paper wanted to see if they could reproduce the findings from the earlier study by extracting data from one of its figures and then performing a similar analysis.

Replicating previous research is important in science to verify the reliability and consistency of the results. If the authors of this paper are able to closely match the findings from the original study, it would lend more credibility to the Chinchilla Scaling phenomenon and suggest that it is a robust relationship that holds true across different experiments and datasets. On the other hand, if they struggle to replicate the results, it could indicate issues with the original study or limitations in the generalizability of the Chinchilla Scaling observations.

## Technical Explanation

The paper begins by extracting data points from Figure 4 in the Unraveling the mystery of neural scaling laws paper by Hoffmann et al. This figure shows the relationship between model performance, parameter count, and training data size for large language models.

The authors then attempt to replicate "Approach 3" from the Hoffmann et al. paper, which involves fitting a power law curve to the extracted data points. This power law relationship is the essence of the Chinchilla Scaling phenomenon, where model performance scales as a function of parameter count and training data size.

The results of the replication attempt are presented and compared to the original findings. The authors discuss the similarities and differences observed, as well as the implications for the reliability and generalizability of the Chinchilla Scaling principles.

## Critical Analysis

The paper acknowledges several limitations and caveats in its replication attempt. For example, the authors note that they were not able to perfectly reproduce the data points from the original figure, which may have introduced some error into their analysis. Additionally, the replication was limited to a single "Approach" from the Hoffmann et al. paper, and the authors suggest that further replication efforts across the different approaches would be valuable.

Another potential issue is that the replication was conducted on the same general dataset and models as the original study, rather than an entirely independent dataset. This raises questions about the extent to which the Chinchilla Scaling observations can be generalized beyond the specific context of this research.

The authors maintain an objective and respectful tone throughout the critical analysis, acknowledging the importance of the original work and the challenges inherent in replication efforts. They encourage readers to thoughtfully consider the findings and limitations, and to continue exploring the reliability and generalizability of the Chinchilla Scaling phenomenon.

## Conclusion

This paper provides a replication attempt of the Chinchilla Scaling research presented in the Unraveling the mystery of neural scaling laws paper. The results suggest that the authors were largely able to reproduce the key findings, lending credibility to the Chinchilla Scaling relationship observed in large language models.

However, the authors also identify several limitations and areas for further research, highlighting the importance of rigorously validating and generalizing important findings in the field of AI and machine learning. Continued efforts to replicate and extend this work will help to solidify our understanding of the fundamental scaling principles that govern the performance of large-scale models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

## Related Papers

### Unraveling the Mystery of Scaling Laws: Part I

Hui Su, Zhi Tian, Xiaoyu Shen, Xunliang Cai

0

0

Scaling law principles indicate a power-law correlation between loss and variables such as model size, dataset size, and computational resources utilized during training. These principles play a vital role in optimizing various aspects of model pre-training, ultimately contributing to the success of large language models such as GPT-4, Llama and Gemini. However, the original scaling law paper by OpenAI did not disclose the complete details necessary to derive the precise scaling law formulas, and their conclusions are only based on models containing up to 1.5 billion parameters. Though some subsequent works attempt to unveil these details and scale to larger models, they often neglect the training dependency of important factors such as the learning rate, context length and batch size, leading to their failure to establish a reliable formula for predicting the test loss trajectory. In this technical report, we confirm that the scaling law formulations proposed in the original OpenAI paper remain valid when scaling the model size up to 33 billion, but the constant coefficients in these formulas vary significantly with the experiment setup. We meticulously identify influential factors and provide transparent, step-by-step instructions to estimate all constant terms in scaling-law formulas by training on models with only 1M~60M parameters. Using these estimated formulas, we showcase the capability to accurately predict various attributes for models with up to 33B parameters before their training, including (1) the minimum possible test loss; (2) the minimum required training steps and processed tokens to achieve a specific loss; (3) the critical batch size with an optimal time/computation trade-off at any loss value; and (4) the complete test loss trajectory with arbitrary batch size.

4/8/2024

### New!Language models scale reliably with over-training and on downstream tasks

Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Wortsman, Rulin Shao, Jean Mercat, Alex Fang, Jeffrey Li, Sedrick Keh, Rui Xin, Marianna Nezhurina, Igor Vasiljevic, Jenia Jitsev, Luca Soldaini, Alexandros G. Dimakis, Gabriel Ilharco, Pang Wei Koh, Shuran Song, Thomas Kollar, Yair Carmon, Achal Dave, Reinhard Heckel, Niklas Muennighoff, Ludwig Schmidt

0

0

Scaling laws are useful guides for derisking expensive training runs, as they predict performance of large models using cheaper, small-scale experiments. However, there remain gaps between current scaling studies and how language models are ultimately trained and evaluated. For instance, scaling is usually studied in the compute-optimal training regime (i.e., Chinchilla optimal regime). In contrast, models are often over-trained to reduce inference costs. Moreover, scaling laws mostly predict loss on next-token prediction, but models are usually compared on downstream task performance. To address both shortcomings, we create a testbed of 104 models with 0.011B to 6.9B parameters trained with various numbers of tokens on three data distributions. First, we fit scaling laws that extrapolate in both the amount of over-training and the number of model parameters. This enables us to predict the validation loss of a 1.4B parameter, 900B token run (i.e., 32$times$ over-trained) and a 6.9B parameter, 138B token run (i.e., a compute-optimal run)$unicode{x2014}$each from experiments that take 300$times$ less compute. Second, we relate the perplexity of a language model to its downstream task performance by proposing a power law. We use this law to predict top-1 error averaged over downstream tasks for the two aforementioned models, using experiments that take 20$times$ less compute. Our experiments are available at https://github.com/mlfoundations/scaling.

6/18/2024

🧠

### 4+3 Phases of Compute-Optimal Neural Scaling Laws

Elliot Paquette, Courtney Paquette, Lechao Xiao, Jeffrey Pennington

0

0

We consider the three parameter solvable neural scaling model introduced by Maloney, Roberts, and Sully. The model has three parameters: data complexity, target complexity, and model-parameter-count. We use this neural scaling model to derive new predictions about the compute-limited, infinite-data scaling law regime. To train the neural scaling model, we run one-pass stochastic gradient descent on a mean-squared loss. We derive a representation of the loss curves which holds over all iteration counts and improves in accuracy as the model parameter count grows. We then analyze the compute-optimal model-parameter-count, and identify 4 phases (+3 subphases) in the data-complexity/target-complexity phase-plane. The phase boundaries are determined by the relative importance of model capacity, optimizer noise, and embedding of the features. We furthermore derive, with mathematical proof and extensive numerical evidence, the scaling-law exponents in all of these phases, in particular computing the optimal model-parameter-count as a function of floating point operation budget.

5/27/2024

### Scaling Laws in Linear Regression: Compute, Parameters, and Data

Licong Lin, Jingfeng Wu, Sham M. Kakade, Peter L. Bartlett, Jason D. Lee

0

0

Empirically, large-scale deep learning models often satisfy a neural scaling law: the test error of the trained model improves polynomially as the model size and data size grow. However, conventional wisdom suggests the test error consists of approximation, bias, and variance errors, where the variance error increases with model size. This disagrees with the general form of neural scaling laws, which predict that increasing model size monotonically improves performance. We study the theory of scaling laws in an infinite dimensional linear regression setup. Specifically, we consider a model with $M$ parameters as a linear function of sketched covariates. The model is trained by one-pass stochastic gradient descent (SGD) using $N$ data. Assuming the optimal parameter satisfies a Gaussian prior and the data covariance matrix has a power-law spectrum of degree $a>1$, we show that the reducible part of the test error is $Theta(M^{-(a-1)} + N^{-(a-1)/a})$. The variance error, which increases with $M$, is dominated by the other errors due to the implicit regularization of SGD, thus disappearing from the bound. Our theory is consistent with the empirical neural scaling laws and verified by numerical simulation.

6/13/2024