Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
225
Sign in to get full access
Overview
 This paper explores techniques for scaling inference compute with repeated sampling in large language models.
 It investigates methods to improve the efficiency and speed of generating text using these models.
 The key focus is on developing strategies to make inference more computationally affordable, allowing for broader practical applications.
Plain English Explanation
The paper is about making it faster and more efficient to use large language models, which are AI systems that can generate humanlike text. These models require a lot of computing power to run, which can be a barrier to using them in many realworld applications.
The researchers explore different techniques to reduce the amount of computing power needed for "inference"  the process of generating new text using the language model. This includes methods like repeated sampling, which can produce highquality text output without needing as much computing power.
The goal is to find ways to make these powerful language models more accessible and practical to use, by making the inference process less computationally intensive. This could unlock new use cases and applications for large language models beyond just research.
Technical Explanation
The paper introduces techniques to scale inference compute with repeated sampling in large language models. Inference, the process of generating new text from a trained model, can be computationally expensive, limiting the practical applications of these powerful AI systems.
The key contributions include:

Repeated Sampling: The authors explore methods to generate highquality text output using multiple rounds of sampling from the language model, rather than a single pass. This can produce similar quality text while requiring less overall compute.

Adaptive Compute Allocation: They develop strategies to dynamically adjust the amount of compute used during inference based on the difficulty of the generation task. This allows for more efficient use of resources.

Ensemblebased Approaches: The paper investigates combining the outputs of multiple language models or sampling approaches to further improve the efficiency and quality of the generated text.
Through a series of experiments, the researchers demonstrate significant reductions in the compute required for inference, without sacrificing the fidelity of the text produced. These techniques could enable broader practical applications of large language models in the real world.
Critical Analysis
The paper provides a thoughtful and thorough exploration of methods to scale inference compute for large language models. The focus on improving the efficiency of the text generation process is wellmotivated, as compute requirements have been a key limitation in the broader adoption of these powerful AI systems.
However, the paper does acknowledge some potential caveats and areas for future work. For example, the adaptive compute allocation approach may not be as effective for generation tasks with high variance in complexity. Additionally, the ensemblebased methods could introduce additional latency or overhead that may limit their practical applicability in certain scenarios.
Further research would be valuable to better understand the tradeoffs between inference efficiency, text quality, and other practical considerations. Exploring the generalization of these techniques to a wider range of language models and use cases would also be an important next step.
Overall, this paper represents a significant contribution to the field, providing novel strategies to make large language models more accessible and usable in realworld applications. The insights and methods presented could have a substantial impact on the future development and deployment of these transformative AI technologies.
Conclusion
This paper tackles the crucial challenge of scaling inference compute for large language models, exploring techniques to make the text generation process more efficient and practical. By introducing methods like repeated sampling, adaptive compute allocation, and ensemblebased approaches, the researchers demonstrate substantial reductions in the computational requirements without sacrificing the quality of the generated text.
These advancements could unlock new use cases and applications for large language models, empowering a wider range of users and organizations to leverage these powerful AI systems. As language models continue to grow in scale and capability, the insights from this work will be instrumental in ensuring these technologies can be deployed more broadly and responsibly in the real world.
This summary was produced with help from an AI and may contain inaccuracies  check out the links to read the original source documents!
Related Papers
225
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V. Le, Christopher R'e, Azalia Mirhoseini
Scaling the amount of compute used to train language models has dramatically improved their capabilities. However, when it comes to inference, we often limit the amount of compute to only one attempt per problem. Here, we explore inference compute as another axis for scaling by increasing the number of generated samples. Across multiple tasks and models, we observe that coverage  the fraction of problems solved by any attempt  scales with the number of samples over four orders of magnitude. In domains like coding and formal proofs, where all answers can be automatically verified, these increases in coverage directly translate into improved performance. When we apply repeated sampling to SWEbench Lite, the fraction of issues solved with DeepSeekV2CoderInstruct increases from 15.9% with one sample to 56% with 250 samples, outperforming the singleattempt stateoftheart of 43% which uses more capable frontier models. Moreover, using current API pricing, amplifying the cheaper DeepSeek model with five samples is more costeffective and solves more issues than paying a premium for one sample from GPT4o or Claude 3.5 Sonnet. Interestingly, the relationship between coverage and the number of samples is often loglinear and can be modelled with an exponentiated power law, suggesting the existence of inferencetime scaling laws. Finally, we find that identifying correct samples out of many generations remains an important direction for future research in domains without automatic verifiers. When solving math word problems from GSM8K and MATH, coverage with Llama3 models grows to over 95% with 10,000 samples. However, common methods to pick correct solutions from a sample collection, such as majority voting or reward models, plateau beyond several hundred samples and fail to fully scale with the sample budget.
Read more8/1/2024
0
Beyond ChinchillaOptimal: Accounting for Inference in Language Model Scaling Laws
Nikhil Sardana, Jacob Portes, Sasha Doubov, Jonathan Frankle
Large language model (LLM) scaling laws are empirical formulas that estimate changes in model quality as a result of increasing parameter count and training data. However, these formulas, including the popular Deepmind Chinchilla scaling laws, neglect to include the cost of inference. We modify the Chinchilla scaling laws to calculate the optimal LLM parameter count and pretraining data size to train and deploy a model of a given quality and inference demand. We conduct our analysis both in terms of a compute budget and realworld costs and find that LLM researchers expecting reasonably large inference demand (~1B requests) should train models smaller and longer than Chinchillaoptimal. Furthermore, we train 47 models of varying sizes and parameter counts to validate our formula and find that model quality continues to improve as we scale tokens per parameter to extreme ranges (up to 10,000). Finally, we ablate the procedure used to fit the Chinchilla scaling law coefficients and find that developing scaling laws only from data collected at typical token/parameter ratios overestimates the impact of additional tokens at these extreme ranges.
Read more7/19/2024
0
An Empirical Analysis of ComputeOptimal Inference for ProblemSolving with Language Models
Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, Yiming Yang
The optimal training configurations of large language models (LLMs) with respect to model sizes and compute budgets have been extensively studied. But how to optimally configure LLMs during inference has not been explored in sufficient depth. We study computeoptimal inference: designing models and inference strategies that optimally trade off additional inferencetime compute for improved performance. As a first step towards understanding and designing computeoptimal inference methods, we assessed the effectiveness and computational efficiency of multiple inference strategies such as Greedy Search, Majority Voting, BestofN, Weighted Voting, and their variants on two different Tree Search algorithms, involving different model sizes and computational budgets. We found that a smaller language model with a novel tree search algorithm typically achieves a Paretooptimal tradeoff. These results highlight the potential benefits of deploying smaller models equipped with more sophisticated decoding algorithms in budgetconstrained scenarios, e.g., on enddevices, to enhance problemsolving accuracy. For instance, we show that the Llemma7B model can achieve competitive accuracy to a Llemma34B model on MATH500 while using $2times$ less FLOPs. Our findings could potentially apply to any generation task with a welldefined measure of success.
Read more8/2/2024
✅
0
More Compute Is What You Need
Zhen Guo
Large language model pretraining has become increasingly expensive, with most practitioners relying on scaling laws to allocate compute budgets for model size and training tokens, commonly referred to as ComputeOptimal or Chinchilla Optimal. In this paper, we hypothesize a new scaling law that suggests model performance depends mostly on the amount of compute spent for transformerbased models, independent of the specific allocation to model size and dataset size. Using this unified scaling law, we predict that (a) for inference efficiency, training should prioritize smaller model sizes and larger training datasets, and (b) assuming the exhaustion of available web datasets, scaling the model size might be the only way to further improve model performance.
Read more5/3/2024