# A Law of Next-Token Prediction in Large Language Models

## Overview

- The paper proposes a law that describes the next-token prediction behavior of large language models (LLMs).
- The law states that the probability of predicting the next token follows a power-law distribution, with a small number of tokens accounting for a large fraction of the total probability.
- The authors investigate when this law emerges and how it is affected by factors like model size and training dataset.

## Plain English Explanation

The paper explores a fascinating pattern observed in how large language models (LLMs) predict the next word in a sequence of text. The researchers discovered that the probability of predicting each possible next word follows a specific mathematical distribution, known as a power-law distribution.

In a power-law distribution, a small number of elements (in this case, words) account for a disproportionately large fraction of the total probability. This means that LLMs are much more confident in predicting a few common words, while the vast majority of words have a much lower probability of being predicted.

The researchers found that this power-law behavior emerges as the models become larger and are trained on more data. They also explored how factors like the size of the model and the diversity of the training dataset can influence the strength of this "law of next-token prediction."

Understanding this pattern in how LLMs make predictions has important implications for how these models learn and represent language. It suggests that LLMs may not just be simple next-token predictors, but that they are capturing deeper structural principles of language.

## Technical Explanation

The paper investigates a "law of next-token prediction" in large language models (LLMs), where the probability of predicting each possible next token follows a power-law distribution. This means that a small fraction of tokens account for a disproportionately large share of the total prediction probability.

The authors analyze this phenomenon across different LLM architectures, model sizes, and training datasets. They find that the power-law behavior emerges as the models become larger and are trained on more diverse datasets. Specifically, they observe that the exponent of the power-law distribution decreases (the distribution becomes more skewed) as the model size and dataset size increase.

The researchers also explore how factors like the amount of model training and the diversity of the training data affect the strength of the power-law behavior. They find that increasing the training dataset size leads to a more pronounced power-law distribution, while increasing the model size has a more complex effect, with the power-law exponent first decreasing and then increasing again.

Additionally, the authors investigate the implications of this law for understanding the inner workings of LLMs. They suggest that the power-law behavior may reflect deeper structural principles in how these models learn and represent language, going beyond simple next-token prediction.

## Critical Analysis

The paper presents a compelling observation about the next-token prediction behavior of large language models, and the authors do a thorough job of investigating the emergence and characteristics of this "law of next-token prediction." However, the paper does not delve into the deeper reasons or mechanisms behind this phenomenon.

One potential limitation is that the analysis is primarily focused on statistical properties of the token probability distributions, without exploring the cognitive or linguistic principles that might underlie this behavior. While the authors suggest that the power-law distribution may reflect deeper structural principles in how LLMs represent language, they do not provide a clear explanation for why this particular distribution arises.

Additionally, the paper does not consider potential implications or applications of this law beyond its theoretical significance. For instance, it would be interesting to explore how this knowledge could be leveraged to improve the performance or interpretability of LLMs, or to gain insights into the cognitive processes involved in human language processing.

Future research could build upon this work by investigating the underlying mechanisms that give rise to the power-law behavior, as well as exploring the practical applications and broader implications of this "law of next-token prediction."

## Conclusion

This paper presents a remarkable discovery about the predictive behavior of large language models: the probability of predicting each possible next token follows a power-law distribution, with a small number of tokens accounting for a disproportionately large fraction of the total probability.

The authors' detailed investigation of this phenomenon provides valuable insights into how these models learn and represent language. While the paper does not fully explain the reasons behind this power-law behavior, it lays the groundwork for further research into the deeper structural principles that govern the functioning of large language models.

Understanding this "law of next-token prediction" could have significant implications for the development and application of LLMs, potentially leading to improved model architectures, more accurate and interpretable predictions, and a deeper understanding of the cognitive processes involved in human language use.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

0

## Related Papers

0

### A Law of Next-Token Prediction in Large Language Models

Hangfeng He, Weijie J. Su

Large language models (LLMs) have been widely employed across various application domains, yet their black-box nature poses significant challenges to understanding how these models process input data internally to make predictions. In this paper, we introduce a precise and quantitative law that governs the learning of contextualized token embeddings through intermediate layers in pre-trained LLMs for next-token prediction. Our findings reveal that each layer contributes equally to enhancing prediction accuracy, from the lowest to the highest layer -- a universal phenomenon observed across a diverse array of open-source LLMs, built on architectures such as Transformer, RWKV, and Mamba. We demonstrate that this law offers new perspectives and insights to inform and guide practices in LLM development and applications, including model scaling, pre-training tasks, and information flow. Overall, our law enables more fine-grained approaches to the design, training, and interpretation of LLMs through scrutinizing their internal data processing mechanisms.

Read more8/27/2024

šļø

0

### LLMs are Not Just Next Token Predictors

Stephen M. Downes, Patrick Forber, Alex Grzankowski

LLMs are statistical models of language learning through stochastic gradient descent with a next token prediction objective. Prompting a popular view among AI modelers: LLMs are just next token predictors. While LLMs are engineered using next token prediction, and trained based on their success at this task, our view is that a reduction to just next token predictor sells LLMs short. Moreover, there are important explanations of LLM behavior and capabilities that are lost when we engage in this kind of reduction. In order to draw this out, we will make an analogy with a once prominent research program in biology explaining evolution and development from the gene's eye view.

Read more8/12/2024

š

97

### Beyond the Black Box: A Statistical Model for LLM Reasoning and Inference

Siddhartha Dalal, Vishal Misra

This paper introduces a novel Bayesian learning model to explain the behavior of Large Language Models (LLMs), focusing on their core optimization metric of next token prediction. We develop a theoretical framework based on an ideal generative text model represented by a multinomial transition probability matrix with a prior, and examine how LLMs approximate this matrix. Key contributions include: (i) a continuity theorem relating embeddings to multinomial distributions, (ii) a demonstration that LLM text generation aligns with Bayesian learning principles, (iii) an explanation for the emergence of in-context learning in larger models, (iv) empirical validation using visualizations of next token probabilities from an instrumented Llama model Our findings provide new insights into LLM functioning, offering a statistical foundation for understanding their capabilities and limitations. This framework has implications for LLM design, training, and application, potentially guiding future developments in the field.

Read more9/25/2024

0

### Performance Law of Large Language Models

Chuhan Wu, Ruiming Tang

Guided by the belief of the scaling law, large language models (LLMs) have achieved impressive performance in recent years. However, scaling law only gives a qualitative estimation of loss, which is influenced by various factors such as model architectures, data distributions, tokenizers, and computation precision. Thus, estimating the real performance of LLMs with different training settings rather than loss may be quite useful in practical development. In this article, we present an empirical equation named Performance Law to directly predict the MMLU score of an LLM, which is a widely used metric to indicate the general capability of LLMs in real-world conversations and applications. Based on only a few key hyperparameters of the LLM architecture and the size of training data, we obtain a quite accurate MMLU prediction of various LLMs with diverse sizes and architectures developed by different organizations in different years. Performance law can be used to guide the choice of LLM architecture and the effective allocation of computational resources without extensive experiments.

Read more9/16/2024