Recent theoretical work has identified surprisingly simple reasoning problems, such as checking if two nodes in a graph are connected or simulating finite-state machines, that are provably unsolvable by standard transformers that answer immediately after reading their input. However, in practice, transformers' reasoning can be improved by allowing them to use a chain of thought or scratchpad, i.e., generate and condition on a sequence of intermediate tokens before answering. Motivated by this, we ask: Does such intermediate generation fundamentally extend the computational power of a decoder-only transformer? We show that the answer is yes, but the amount of increase depends crucially on the amount of intermediate generation. For instance, we find that transformer decoders with a logarithmic number of decoding steps (w.r.t. the input length) push the limits of standard transformers only slightly, while a linear number of decoding steps, assuming projected pre-norm (a slight generalization of standard pre-norm), adds a clear new ability (under standard complexity conjectures): recognizing all regular languages. Our results also imply that linear steps keep transformer decoders within context-sensitive languages, and polynomial steps with generalized pre-norm make them recognize exactly the class of polynomial-time solvable problems -- the first exact characterization of a type of transformers in terms of standard complexity classes. Together, this provides a nuanced framework for understanding how the length of a transformer's chain of thought or scratchpad impacts its reasoning power.

## Overview

- Researchers have found that standard transformer models, which provide immediate outputs, are limited in their ability to solve certain simple reasoning problems.
- However, transformers can improve their reasoning by generating and conditioning on a sequence of intermediate tokens before answering, known as a "chain of thought" or "scratchpad."
- This paper explores whether this intermediate generation fundamentally extends the computational power of a decoder-only transformer.

## Plain English Explanation

Transformers are a type of [artificial intelligence](https://aimodels.fyi/papers/arxiv/symbolic-framework-evaluating-mathematical-reasoning-generalisation-transformers) model that have become widely used for tasks like language processing and generation. These models typically provide an immediate output after reading their input.

However, recent research has shown that there are some surprisingly simple reasoning problems, such as [checking if two nodes in a graph are connected](https://aimodels.fyi/papers/arxiv/transformers-as-transducers) or [simulating finite-state machines](https://aimodels.fyi/papers/arxiv/theoretical-expressive-power-design-space-higher-order), that standard transformers cannot solve effectively. 

To address this limitation, the researchers in this paper explored whether allowing transformers to use a "chain of thought" or "scratchpad" could fundamentally increase their computational power. This means the transformer would generate and condition on a sequence of intermediate tokens before providing a final answer.

The key finding is that the answer is yes, but the amount of increase in computational power depends heavily on the length of the intermediate generation. For example, a transformer with a logarithmic number of decoding steps (relative to the input length) only slightly improves on standard transformers, while a linear number of decoding steps can allow the transformer to recognize all regular languages, a clear new ability.

The researchers also show that linear decoding steps keep transformers within the class of [context-sensitive languages](https://aimodels.fyi/papers/arxiv/demystifying-chains-trees-graphs-thoughts), while polynomial steps with a certain generalization can make them recognize exactly the class of problems solvable in polynomial time. This provides a nuanced framework for understanding how the length of a transformer's "chain of thought" impacts its reasoning power.

## Technical Explanation

This paper examines whether allowing transformer decoders to generate and condition on a sequence of intermediate tokens, rather than providing a single immediate output, can fundamentally extend their computational power.

The researchers first establish that standard transformer decoders are provably limited in their ability to solve certain simple reasoning problems, such as [checking graph connectivity](https://aimodels.fyi/papers/arxiv/transformers-as-transducers) and [simulating finite-state machines](https://aimodels.fyi/papers/arxiv/theoretical-expressive-power-design-space-higher-order). This is due to their inability to maintain and update an internal state or "scratchpad" as they process the input.

To overcome this limitation, the authors consider transformer decoders that are allowed to generate and condition on a sequence of intermediate tokens before providing a final output. This mimics the "chain of thought" or "scratchpad" that humans often use when solving complex reasoning problems.

The key results are:

- A logarithmic number of decoding steps (relative to input length) only slightly extends the power of standard transformers.
- A linear number of decoding steps, with a slight generalization of standard "pre-norm" layers, allows transformers to recognize all regular languages, a clear new ability.
- Linear decoding steps keep transformers within the class of [context-sensitive languages](https://aimodels.fyi/papers/arxiv/demystifying-chains-trees-graphs-thoughts).
- Polynomial decoding steps with a further generalization make transformers recognize exactly the class of problems solvable in polynomial time.

These findings provide a nuanced framework for understanding how the length of a transformer's "chain of thought" impacts its reasoning capabilities, ranging from slight improvements to the ability to solve more complex problems.

## Critical Analysis

The paper provides a rigorous theoretical analysis of how the ability to generate and condition on intermediate tokens can extend the computational power of transformer decoders. The insights offered are valuable for understanding the fundamental limitations and capabilities of these widely used models.

One potential limitation of the research is that it focuses solely on the theoretical computational power of transformers, without considering practical aspects such as training dynamics, sample efficiency, and real-world performance. While the theoretical results are important, it would be valuable to see how these insights translate to actual transformer-based systems and their performance on relevant tasks.

Additionally, the paper does not explore the implications of these findings for the development of more capable reasoning systems. Further research could investigate how the insights from this work could be leveraged to design transformer architectures or training approaches that better support robust reasoning and problem-solving abilities.

Overall, this paper offers a [significant contribution](https://aimodels.fyi/papers/arxiv/topos-transformer-networks) to the understanding of transformer models and their computational limitations and capabilities. The findings provide a solid foundation for future research in this area.

## Conclusion

This paper investigates whether allowing transformer decoders to generate and condition on a sequence of intermediate tokens, rather than providing a single immediate output, can fundamentally extend their computational power. The key finding is that it can, but the extent of the increase depends heavily on the length of the intermediate generation process.

The researchers show that a logarithmic number of decoding steps only slightly improves on standard transformers, while a linear number of steps can enable transformers to recognize all regular languages. They also characterize the computational classes that transformers with different step counts can represent, providing a nuanced framework for understanding the impact of "chain of thought" or "scratchpad" generation on reasoning abilities.

These insights contribute to a deeper understanding of the fundamental limitations and capabilities of transformer models, which are widely used in various artificial intelligence applications. The work also suggests promising directions for future research on developing more capable reasoning systems, by leveraging the potential of intermediate generation to push the boundaries of what transformers can achieve.