# Transformers Can Do Arithmetic with the Right Embeddings

2405.17399

206

0

## Abstract

The poor performance of transformers on arithmetic tasks seems to stem in large part from their inability to keep track of the exact position of each digit inside of a large span of digits. We mend this problem by adding an embedding to each digit that encodes its position relative to the start of the number. In addition to the boost these embeddings provide on their own, we show that this fix enables architectural modifications such as input injection and recurrent layers to improve performance even further. With positions resolved, we can study the logical extrapolation ability of transformers. Can they solve arithmetic problems that are larger and more complex than those in their training data? We find that training on only 20 digit numbers with a single GPU for one day, we can reach state-of-the-art performance, achieving up to 99% accuracy on 100 digit addition problems. Finally, we show that these gains in numeracy also unlock improvements on other multi-step reasoning tasks including sorting and multiplication.

Get summaries of the top AI research delivered straight to your inbox:

## Overview

- This paper investigates the ability of Transformer language models to perform simple arithmetic operations on numerical values embedded within text.
- The researchers explore how the choice of numerical embedding can impact the model's numeric reasoning capabilities.
- They find that Transformers can indeed learn to perform basic arithmetic when provided with appropriate numerical embeddings, but struggle with more complex operations or generalization beyond the training distribution.

## Plain English Explanation

The researchers in this paper wanted to see if large language models like Transformers can do simple math when they encounter numbers in the text they're reading. Language models are AI systems that are trained on huge amounts of text data to understand and generate human language.

The key question the researchers explored is: if you give a Transformer model numbers embedded in text, can it learn to do basic arithmetic operations like addition and multiplication on those numbers? The researchers tried different ways of representing the numbers within the Transformer's inputs and found that the choice of numerical embedding can make a big difference in the model's ability to reason about the numbers.

When the Transformers were given the right kind of numerical embeddings, they were able to learn how to do simple arithmetic. However, the models still struggled with more complex math or with generalizing their numerical reasoning skills beyond the specific examples they were trained on. The paper provides insights into the strengths and limitations of Transformers when it comes to learning to work with numerical information in text.

## Technical Explanation

The researchers investigated the numeric reasoning capabilities of Transformer language models by designing a suite of arithmetic tasks. They explored how the choice of numerical embedding - the way the model represents numbers in its internal computations - impacts the model's ability to perform basic arithmetic operations.

The researchers experimented with several different numerical embedding schemes, including linear scaling, logarithmic scaling, and learnable embeddings. They found that the choice of embedding had a significant effect on the model's arithmetic performance. Linear scaling, for example, allowed the model to learn addition and subtraction, while logarithmic scaling enabled it to also learn multiplication and division.

Further experiments revealed the limitations of the Transformer models. While they could learn to perform basic arithmetic when given the right numerical representations, they struggled to generalize this numeric reasoning beyond the specific training distributions. The models also had difficulty with more complex operations involving multiple steps or more abstract mathematical concepts.

The paper provides valuable insights into the inner workings of Transformer language models and their ability to reason about numerical information. The results suggest that these models can be trained to exhibit basic "number sense", but significant challenges remain in developing their full arithmetic and mathematical reasoning capabilities.

## Critical Analysis

The paper makes a valuable contribution by systematically exploring the numeric reasoning abilities of Transformer language models. The experimental setup and analysis are rigorous, and the findings offer important insights into the strengths and limitations of these models when it comes to working with numerical information.

That said, the paper acknowledges several caveats and areas for further research. For example, the arithmetic tasks examined in the study are relatively simple, and it remains to be seen whether Transformers can handle more complex mathematical operations or reasoning. Additionally, the paper does not address the practical implications of these findings for real-world applications of language models.

One potential concern is the reliance on specific numerical embedding schemes. While the researchers demonstrate the importance of this design choice, it's unclear how these embedding strategies would scale or generalize to more diverse numerical data encountered in real-world settings. Further work is needed to develop more robust and flexible numerical representations for Transformer models.

Additionally, the paper does not explore the potential role of pretraining or fine-tuning in enhancing the numeric reasoning capabilities of Transformers. Exploring Internal Numeracy: A Case Study of Language Models has shown that some degree of numeric reasoning can emerge during standard language model pretraining, suggesting that more targeted training approaches may lead to further improvements.

Overall, this paper provides a valuable foundation for understanding the numeric reasoning abilities of Transformer language models. The findings highlight the importance of considering numerical representations and the limitations of current approaches, paving the way for future research to address these challenges and unlock the full mathematical potential of these powerful language models.

## Conclusion

This paper investigates the numeric reasoning capabilities of Transformer language models, exploring how the choice of numerical embedding can impact their ability to perform basic arithmetic operations. The researchers find that Transformers can learn to do simple math when provided with the right numerical representations, but struggle with more complex operations or generalization beyond their training data.

The results offer important insights into the inner workings of these language models and the critical role of numerical representations in enabling numeric reasoning. While the findings suggest that Transformers can exhibit a basic "number sense", significant challenges remain in developing their full mathematical reasoning capabilities.

Future research should explore more advanced numerical representations and training approaches to further enhance the Transformers' ability to work with numerical information in practical applications. By addressing these challenges, the field can unlock the full potential of large language models to engage in more sophisticated mathematical reasoning and problem-solving.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

## Related Papers

### Position Coupling: Leveraging Task Structure for Improved Length Generalization of Transformers

Hanseul Cho, Jaeyoung Cha, Pranjal Awasthi, Srinadh Bhojanapalli, Anupam Gupta, Chulhee Yun

0

0

Even for simple arithmetic tasks like integer addition, it is challenging for Transformers to generalize to longer sequences than those encountered during training. To tackle this problem, we propose position coupling, a simple yet effective method that directly embeds the structure of the tasks into the positional encoding of a (decoder-only) Transformer. Taking a departure from the vanilla absolute position mechanism assigning unique position IDs to each of the tokens, we assign the same position IDs to two or more relevant tokens; for integer addition tasks, we regard digits of the same significance as in the same position. On the empirical side, we show that with the proposed position coupling, a small (1-layer) Transformer trained on 1 to 30-digit additions can generalize up to 200-digit additions (6.67x of the trained length). On the theoretical side, we prove that a 1-layer Transformer with coupled positions can solve the addition task involving exponentially many digits, whereas any 1-layer Transformer without positional information cannot entirely solve it. We also demonstrate that position coupling can be applied to other algorithmic tasks such as addition with multiple summands, Nx2 multiplication, copy/reverse, and a two-dimensional task.

6/3/2024

### Explicitly Encoding Structural Symmetry is Key to Length Generalization in Arithmetic Tasks

Mahdi Sabbaghi, George Pappas, Hamed Hassani, Surbhi Goel

0

0

Despite the success of Transformers on language understanding, code generation, and logical reasoning, they still fail to generalize over length on basic arithmetic tasks such as addition and multiplication. A major reason behind this failure is the vast difference in structure between numbers and text; For example, the numbers are typically parsed from right to left, and there is a correspondence between digits at the same position across different numbers. In contrast, for text, such symmetries are quite unnatural. In this work, we propose to encode these semantics explicitly into the model via modified number formatting and custom positional encodings. Empirically, our method allows a Transformer trained on numbers with at most 5-digits for addition and multiplication to generalize up to 50-digit numbers, without using additional data for longer sequences. We further demonstrate that traditional absolute positional encodings (APE) fail to generalize to longer sequences, even when trained with augmented data that captures task symmetries. To elucidate the importance of explicitly encoding structure, we prove that explicit incorporation of structure via positional encodings is necessary for out-of-distribution generalization. Finally, we pinpoint other challenges inherent to length generalization beyond capturing symmetries, in particular complexity of the underlying task, and propose changes in the training distribution to address them.

6/5/2024

### Language Models Do Hard Arithmetic Tasks Easily and Hardly Do Easy Arithmetic Tasks

Andrew Gambardella, Yusuke Iwasawa, Yutaka Matsuo

0

0

The ability (and inability) of large language models (LLMs) to perform arithmetic tasks has been the subject of much theoretical and practical debate. We show that LLMs are frequently able to correctly and confidently predict the first digit of n-digit by m-digit multiplication tasks without using chain of thought reasoning, despite these tasks require compounding operations to solve. Simultaneously, LLMs in practice often fail to correctly or confidently predict the last digit of an n-digit by m-digit multiplication, a task equivalent to 1-digit by 1-digit multiplication which can be easily learned or memorized. We show that the latter task can be solved more robustly when the LLM is conditioned on all of the correct higher-order digits, which on average increases the confidence of the correct last digit on 5-digit by 5-digit multiplication tasks using Llama 2-13B by over 230% (0.13 to 0.43) and Mistral-7B by 150% (0.22 to 0.55).

6/5/2024

š

### From Interpolation to Extrapolation: Complete Length Generalization for Arithmetic Transformers

Shaoxiong Duan, Yining Shi, Wei Xu

0

0

In this paper, we investigate the inherent capabilities of transformer models in learning arithmetic algorithms, such as addition and parity. Through experiments and attention analysis, we identify a number of crucial factors for achieving optimal length generalization. We show that transformer models are able to generalize to long lengths with the help of targeted attention biasing. In particular, our solution solves the Parity task, a well-known and theoretically proven failure mode for Transformers. We then introduce Attention Bias Calibration (ABC), a calibration stage that enables the model to automatically learn the proper attention biases, which we show to be connected to mechanisms in relative position encoding. We demonstrate that using ABC, the transformer model can achieve unprecedented near-perfect length generalization on certain arithmetic tasks. In addition, we show that ABC bears remarkable similarities to RPE and LoRA, which may indicate the potential for applications to more complex tasks.

5/13/2024