# More Compute Is What You Need

🤿

## Overview

- Researchers propose a new scaling law that suggests model performance depends mostly on the amount of compute spent, rather than the specific allocation to model size and dataset size.
- This scaling law suggests that for inference efficiency, training should prioritize smaller model sizes and larger training datasets.
- Assuming the exhaustion of available web datasets, the researchers predict that scaling the model size might be the only way to further improve model performance.

## Plain English Explanation

Training large language models has become increasingly expensive for most practitioners. To manage these costs, they commonly use scaling laws to decide how to allocate computing resources between model size and the amount of training data.

The researchers in this paper propose a new scaling law that suggests a different approach. They found that model performance depends mostly on the total amount of computing power used, rather than the specific balance between model size and dataset size.

This means that, for the most efficient performance during real-world use (inference efficiency), the best strategy is to train smaller models but use larger datasets. The researchers also predict that, once we've used up all the available web data, the only way to further improve model performance will be to increase the size of the models themselves.

## Technical Explanation

The researchers hypothesized a new scaling law that suggests the performance of transformer-based language models depends primarily on the total amount of compute used during training, rather than the specific allocation between model size and dataset size.

To test this, they trained a series of models with varying combinations of model size and dataset size, while keeping the total compute constant. They found that model performance scaled similarly regardless of the specific allocation, supporting their proposed scaling law.

Based on this finding, the researchers make two key predictions:

- For the most efficient model performance during real-world use (inference efficiency), training should prioritize using smaller model sizes and larger training datasets.
- Assuming the exhaustion of available web datasets, scaling up the model size might be the only way to further improve model performance, since increasing dataset size would no longer be an option.

The researchers' proposed scaling law builds on previous work on scaling laws for language models and data filtering/curation.

## Critical Analysis

The researchers acknowledge that their proposed scaling law may not hold for all types of transformer-based models or tasks. The experiments were focused on a particular class of language models, and the conclusions may not generalize to other domains like speech recognition or multimodal applications.

Additionally, the researchers note that their findings assume the availability of high-quality web-scraped datasets. If the remaining available web data is of lower quality or relevance, then scaling the model size may not be as effective as they predict.

Further research is needed to validate the researchers' scaling law across a wider range of model architectures, tasks, and data regimes. It will also be important to understand the underlying mechanisms that lead to their proposed scaling behavior.

## Conclusion

This paper presents a novel scaling law that suggests the performance of transformer-based language models depends primarily on the total compute used during training, rather than the specific allocation between model size and dataset size.

The researchers' findings have important implications for the efficient training of large language models, as they indicate that prioritizing larger datasets over larger model sizes can lead to better real-world performance. However, the proposed scaling law may have limitations, and additional research is needed to fully understand its applicability and implications for the field of natural language processing.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

0

## Related Papers

🤿

0

### More Compute Is What You Need

Zhen Guo

Large language model pre-training has become increasingly expensive, with most practitioners relying on scaling laws to allocate compute budgets for model size and training tokens, commonly referred to as Compute-Optimal or Chinchilla Optimal. In this paper, we hypothesize a new scaling law that suggests model performance depends mostly on the amount of compute spent for transformer-based models, independent of the specific allocation to model size and dataset size. Using this unified scaling law, we predict that (a) for inference efficiency, training should prioritize smaller model sizes and larger training datasets, and (b) assuming the exhaustion of available web datasets, scaling the model size might be the only way to further improve model performance.

Read more5/3/2024

0

### Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws

Nikhil Sardana, Jacob Portes, Sasha Doubov, Jonathan Frankle

Large language model (LLM) scaling laws are empirical formulas that estimate changes in model quality as a result of increasing parameter count and training data. However, these formulas, including the popular Deepmind Chinchilla scaling laws, neglect to include the cost of inference. We modify the Chinchilla scaling laws to calculate the optimal LLM parameter count and pre-training data size to train and deploy a model of a given quality and inference demand. We conduct our analysis both in terms of a compute budget and real-world costs and find that LLM researchers expecting reasonably large inference demand (~1B requests) should train models smaller and longer than Chinchilla-optimal. Furthermore, we train 47 models of varying sizes and parameter counts to validate our formula and find that model quality continues to improve as we scale tokens per parameter to extreme ranges (up to 10,000). Finally, we ablate the procedure used to fit the Chinchilla scaling law coefficients and find that developing scaling laws only from data collected at typical token/parameter ratios overestimates the impact of additional tokens at these extreme ranges.

Read more7/19/2024

🤯

0

### Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations

Georgy Tyukin

Large Language Models are growing in size, and we expect them to continue to do so, as larger models train quicker. However, this increase in size will severely impact inference costs. Therefore model compression is important, to retain the performance of larger models, but with a reduced cost of running them. In this thesis we explore the methods of model compression, and we empirically demonstrate that the simple method of skipping latter attention sublayers in Transformer LLMs is an effective method of model compression, as these layers prove to be redundant, whilst also being incredibly computationally expensive. We observed a 21% speed increase in one-token generation for Llama 2 7B, whilst surprisingly and unexpectedly improving performance over several common benchmarks.

Read more4/10/2024

0

### Resolving Discrepancies in Compute-Optimal Scaling of Language Models

Tomer Porian, Mitchell Wortsman, Jenia Jitsev, Ludwig Schmidt, Yair Carmon

Kaplan et al. and Hoffmann et al. developed influential scaling laws for the optimal model size as a function of the compute budget, but these laws yield substantially different predictions. We explain the discrepancy by reproducing the Kaplan scaling law on two datasets (OpenWebText2 and RefinedWeb) and identifying three factors causing the difference: last layer computational cost, warmup duration, and scale-dependent optimizer tuning. With these factors corrected, we obtain excellent agreement with the Hoffmann et al. (i.e., Chinchilla) scaling law. Counter to a hypothesis of Hoffmann et al., we find that careful learning rate decay is not essential for the validity of their scaling law. As a secondary result, we derive scaling laws for the optimal learning rate and batch size, finding that tuning the AdamW $beta_2$ parameter is essential at lower batch sizes.

Read more10/29/2024