# How Far Can Transformers Reason? The Locality Barrier and Inductive Scratchpad

## Overview

- This paper explores the reasoning capabilities of transformers, a type of machine learning model that has achieved impressive performance on a variety of tasks.
- The authors investigate the limits of transformer's ability to reason about logical relationships, specifically looking at their performance on syllogistic reasoning tasks.
- They introduce the concept of the "locality barrier," which suggests that transformers may struggle to capture long-range dependencies and abstract reasoning required for tasks like syllogisms.
- The researchers also propose a novel architecture called the "Inductive Scratchpad" that aims to improve transformer's reasoning abilities by providing an explicit memory component.

## Plain English Explanation

Transformers are a powerful type of machine learning model that have achieved great success in many areas, such as language processing and generation. However, it's not clear how well they can handle more abstract, logical reasoning tasks.

In this paper, the authors looked at how well transformers can solve syllogisms - a type of logical reasoning problem that involves making deductions based on two premises. For example, if we know that "all birds are animals" and "all ducks are birds," we can logically conclude that "all ducks are animals."

The researchers found that transformers struggle with these types of logical reasoning tasks. They propose that this is due to the "locality barrier" - the idea that transformers have difficulty capturing long-range dependencies and abstract concepts that are crucial for solving syllogisms.

To address this limitation, the authors developed a new transformer-based architecture called the "Inductive Scratchpad." This model includes an explicit memory component that helps the transformer better represent and reason about the abstract logical relationships required for syllogistic reasoning.

The key idea is to give the transformer an "external scratchpad" where it can store and manipulate the logical premises and intermediate reasoning steps, rather than trying to encode all of that information solely in its internal parameters.

## Technical Explanation

The paper begins by examining transformer models' performance on syllogistic reasoning tasks, which require abstracting and composing logical rules. The authors find that standard transformer architectures struggle with these tasks, exhibiting a "locality barrier" - an inability to effectively capture long-range dependencies and logical abstractions.

To address this limitation, the researchers propose a novel "Inductive Scratchpad" transformer architecture. This model includes an explicit memory component that serves as a "scratchpad" for storing and manipulating the logical premises and intermediate reasoning steps. This allows the transformer to more effectively represent and reason about the abstract logical relationships required for syllogistic tasks.

The Inductive Scratchpad architecture consists of a standard transformer encoder, along with an additional scratchpad module. The scratchpad is a learned, content-addressable memory that the transformer can use to store and retrieve relevant logical information during the reasoning process.

Through experiments on a range of syllogistic reasoning benchmarks, the authors demonstrate that the Inductive Scratchpad transformer significantly outperforms standard transformer models, closing the "locality barrier" and achieving strong performance on these abstract logical reasoning tasks.

## Critical Analysis

The paper provides a valuable contribution by clearly identifying a key limitation in transformer models' reasoning capabilities and proposing a novel architectural approach to address it. The "locality barrier" concept is a useful framework for understanding transformers' struggles with tasks that require long-range dependencies and abstract logical reasoning.

However, the paper also acknowledges several caveats and limitations of the research. For example, the syllogistic reasoning tasks used in the experiments may not fully capture the breadth of logical reasoning required in real-world applications. Additionally, the Inductive Scratchpad architecture, while effective on the tested benchmarks, may not generalize as well to more complex or open-ended reasoning problems.

Further research is needed to more comprehensively evaluate transformers' logical reasoning abilities and the generalizability of the Inductive Scratchpad approach. Exploring how this architecture performs on a wider range of reasoning tasks, as well as investigating its scalability and robustness, would be valuable next steps.

## Conclusion

This paper makes an important contribution to our understanding of transformer models' reasoning capabilities. By identifying the "locality barrier" and proposing the Inductive Scratchpad architecture, the authors have taken a significant step towards improving transformers' ability to engage in abstract, logical reasoning.

The findings have implications for the development of more advanced AI systems that can seamlessly integrate language understanding with logical inference and reasoning. As transformer-based models continue to be widely adopted, addressing their limitations in tasks like syllogistic reasoning will be crucial for unlocking their full potential in real-world applications that require robust, flexible intelligence.

The paper serves as a valuable resource for AI researchers and developers, highlighting the importance of carefully examining the underlying reasoning mechanisms of powerful machine learning models like transformers. By better understanding their strengths and weaknesses, we can work towards more capable and trustworthy AI systems that can tackle increasingly complex problems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

6

## Related Papers

6

### How Far Can Transformers Reason? The Locality Barrier and Inductive Scratchpad

Emmanuel Abbe, Samy Bengio, Aryo Lotfi, Colin Sandon, Omid Saremi

Can Transformers predict new syllogisms by composing established ones? More generally, what type of targets can be learned by such models from scratch? Recent works show that Transformers can be Turing-complete in terms of expressivity, but this does not address the learnability objective. This paper puts forward the notion of 'globality degree' of a target distribution to capture when weak learning is efficiently achievable by regular Transformers. This measure shows a contrast with the expressivity results of Transformers captured by $TC^0/TC^1$ classes (further studied here), since the globality relates to correlations with the more limited $NC^0$ class. We show here experimentally and theoretically under additional assumptions that distributions with high globality cannot be learned efficiently. In particular, syllogisms cannot be composed on long chains. Further, we develop scratchpad techniques and show that: (i) agnostic scratchpads cannot break the globality barrier, (ii) educated scratchpads can break the globality with intermediate steps, although not all such scratchpads can generalize out-of-distribution (OOD), (iii) a notion of 'inductive scratchpad', that composes the prior information more efficiently, can both break the globality barrier and improve the OOD generalization. In particular, some of our inductive scratchpads can achieve length generalizations of up to $6times$ for some arithmetic tasks depending on the input formatting.

Read more11/5/2024

↗️

14

### The Expressive Power of Transformers with Chain of Thought

William Merrill, Ashish Sabharwal

Recent theoretical work has identified surprisingly simple reasoning problems, such as checking if two nodes in a graph are connected or simulating finite-state machines, that are provably unsolvable by standard transformers that answer immediately after reading their input. However, in practice, transformers' reasoning can be improved by allowing them to use a chain of thought or scratchpad, i.e., generate and condition on a sequence of intermediate tokens before answering. Motivated by this, we ask: Does such intermediate generation fundamentally extend the computational power of a decoder-only transformer? We show that the answer is yes, but the amount of increase depends crucially on the amount of intermediate generation. For instance, we find that transformer decoders with a logarithmic number of decoding steps (w.r.t. the input length) push the limits of standard transformers only slightly, while a linear number of decoding steps, assuming projected pre-norm (a slight generalization of standard pre-norm), adds a clear new ability (under standard complexity conjectures): recognizing all regular languages. Our results also imply that linear steps keep transformer decoders within context-sensitive languages, and polynomial steps with generalized pre-norm make them recognize exactly the class of polynomial-time solvable problems -- the first exact characterization of a type of transformers in terms of standard complexity classes. Together, this provides a nuanced framework for understanding how the length of a transformer's chain of thought or scratchpad impacts its reasoning power.

Read more4/15/2024

0

### How Transformers Solve Propositional Logic Problems: A Mechanistic Analysis

Guan Zhe Hong, Nishanth Dikkala, Enming Luo, Cyrus Rashtchian, Xin Wang, Rina Panigrahy

Large language models (LLMs) have shown amazing performance on tasks that require planning and reasoning. Motivated by this, we investigate the internal mechanisms that underpin a network's ability to perform complex logical reasoning. We first construct a synthetic propositional logic problem that serves as a concrete test-bed for network training and evaluation. Crucially, this problem demands nontrivial planning to solve, but we can train a small transformer to achieve perfect accuracy. Building on our set-up, we then pursue an understanding of precisely how a three-layer transformer, trained from scratch, solves this problem. We are able to identify certain planning and reasoning circuits in the network that necessitate cooperation between the attention blocks to implement the desired logic. To expand our findings, we then study a larger model, Mistral 7B. Using activation patching, we characterize internal components that are critical in solving our logic problem. Overall, our work systemically uncovers novel aspects of small and large transformers, and continues the study of how they plan and reason.

Read more11/8/2024

↗️

0

### When can transformers reason with abstract symbols?

Enric Boix-Adsera, Omid Saremi, Emmanuel Abbe, Samy Bengio, Etai Littwin, Joshua Susskind

We investigate the capabilities of transformer models on relational reasoning tasks. In these tasks, models are trained on a set of strings encoding abstract relations, and are then tested out-of-distribution on data that contains symbols that did not appear in the training dataset. We prove that for any relational reasoning task in a large family of tasks, transformers learn the abstract relations and generalize to the test set when trained by gradient descent on sufficiently large quantities of training data. This is in contrast to classical fully-connected networks, which we prove fail to learn to reason. Our results inspire modifications of the transformer architecture that add only two trainable parameters per head, and that we empirically demonstrate improve data efficiency for learning to reason.

Read more4/17/2024