SequenceMatch: Imitation Learning for Autoregressive Sequence Modelling with Backtracking

2306.05426

YC

82

Reddit

0

Published 5/7/2024 by Chris Cundy, Stefano Ermon

🔗

Abstract

In many domains, autoregressive models can attain high likelihood on the task of predicting the next observation. However, this maximum-likelihood (MLE) objective does not necessarily match a downstream use-case of autoregressively generating high-quality sequences. The MLE objective weights sequences proportionally to their frequency under the data distribution, with no guidance for the model's behaviour out of distribution (OOD): leading to compounding error during autoregressive generation. In order to address this compounding error problem, we formulate sequence generation as an imitation learning (IL) problem. This allows us to minimize a variety of divergences between the distribution of sequences generated by an autoregressive model and sequences from a dataset, including divergences with weight on OOD generated sequences. The IL framework also allows us to incorporate backtracking by introducing a backspace action into the generation process. This further mitigates the compounding error problem by allowing the model to revert a sampled token if it takes the sequence OOD. Our resulting method, SequenceMatch, can be implemented without adversarial training or architectural changes. We identify the SequenceMatch-$chi^2$ divergence as a more suitable training objective for autoregressive models which are used for generation. We show that empirically, SequenceMatch training leads to improvements over MLE on text generation with language models and arithmetic.

Get summaries of the top AI research delivered straight to your inbox:

Overview

  • Autoregressive models can predict the next observation well, but this maximum-likelihood (MLE) objective does not necessarily lead to high-quality sequence generation.
  • The MLE objective focuses on sequence frequency, without guidance for behavior outside the training distribution, leading to compounding errors during generation.
  • To address this, the paper formulates sequence generation as an imitation learning (IL) problem, minimizing divergences between the generated and training distributions, including for out-of-distribution (OOD) sequences.
  • The IL framework also allows incorporating backtracking, where the model can revert a sampled token if it takes the sequence OOD.
  • The resulting method, SequenceMatch, can be implemented without adversarial training or architectural changes.
  • The SequenceMatch-$\chi^2$ divergence is identified as a more suitable training objective for autoregressive generation models.

Plain English Explanation

Autoregressive models are good at predicting the next piece of a sequence, like the next word in a sentence. However, this doesn't necessarily mean they can generate high-quality, coherent sequences. The standard training objective, called maximum-likelihood estimation (MLE), focuses on how likely each sequence is in the training data. This can lead to issues when the model tries to generate sequences that are very different from the training data, as the errors can compound over time.

To address this, the paper proposes formulating sequence generation as an "imitation learning" problem. This means training the model to mimic the distribution of sequences in the training data, including penalizing sequences that are very different. The imitation learning framework also allows the model to "backtrack" and undo previous decisions if it starts generating poor sequences.

The resulting method, called SequenceMatch, can be implemented without complex changes to the model architecture or training process. The authors identify a specific type of divergence measure, called the SequenceMatch-$\chi^2$ divergence, as particularly well-suited for training autoregressive models for generation tasks.

Technical Explanation

The paper proposes addressing the compounding error problem in autoregressive generation by formulating the task as an "imitation learning" problem. This allows minimizing a variety of divergences between the distribution of sequences generated by the autoregressive model and the distribution of sequences in the training data.

Importantly, this includes divergences that place weight on out-of-distribution (OOD) generated sequences, which the standard maximum-likelihood estimation (MLE) objective does not. The imitation learning framework also enables incorporating a "backspace" action, where the model can revert a previously sampled token if it takes the sequence OOD.

The resulting method, called SequenceMatch, can be implemented without adversarial training or architectural changes to the autoregressive model. The authors identify the SequenceMatch-$\chi^2$ divergence as a particularly suitable training objective, as it focuses on matching the broader characteristics of the data distribution rather than just the highest-likelihood sequences.

The paper demonstrates empirical improvements of SequenceMatch over MLE training on text generation tasks using language models, as well as on an arithmetic task.

Critical Analysis

The paper presents a novel approach to training autoregressive models for high-quality sequence generation, addressing a key limitation of the standard MLE objective. The imitation learning framework and incorporation of backtracking are interesting technical contributions.

However, the paper does not deeply explore the limitations of the SequenceMatch approach. For example, it is not clear how the method would scale to very large or diverse datasets, or how sensitive the performance is to hyperparameter choices. Additionally, the relationship between the internal language model and sequence-discriminative objectives could be further investigated.

The robustness of the SequenceMatch objectives to distributional shift or adversarial perturbations is also an open question. Lastly, the paper does not situate the SequenceMatch approach within the broader context of sequence-to-sequence generation methods, which could provide additional insight.

Overall, the paper presents a promising direction for improving autoregressive generation, but more research is needed to fully understand the strengths, weaknesses, and scope of applicability of the SequenceMatch approach.

Conclusion

This paper proposes a novel approach to training autoregressive models for high-quality sequence generation, formulating the task as an imitation learning problem. By minimizing divergences between the generated and training distributions, including for out-of-distribution sequences, and incorporating a backtracking mechanism, the SequenceMatch method can outperform standard maximum-likelihood training.

The key insight is that the MLE objective, while effective for predicting the next observation, does not necessarily align with generating coherent, high-quality sequences. The imitation learning framework provides a principled way to address this mismatch, with the potential for broader applicability in other generative modeling domains.

While the paper demonstrates promising empirical results, further research is needed to fully understand the strengths, limitations, and best practices for applying the SequenceMatch approach. Nonetheless, this work represents an important step forward in improving the sequence generation capabilities of autoregressive models.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Recasting Continual Learning as Sequence Modeling

Soochan Lee, Jaehyeon Son, Gunhee Kim

YC

0

Reddit

0

In this work, we aim to establish a strong connection between two significant bodies of machine learning research: continual learning and sequence modeling. That is, we propose to formulate continual learning as a sequence modeling problem, allowing advanced sequence models to be utilized for continual learning. Under this formulation, the continual learning process becomes the forward pass of a sequence model. By adopting the meta-continual learning (MCL) framework, we can train the sequence model at the meta-level, on multiple continual learning episodes. As a specific example of our new formulation, we demonstrate the application of Transformers and their efficient variants as MCL methods. Our experiments on seven benchmarks, covering both classification and regression, show that sequence models can be an attractive solution for general MCL.

Read more

5/31/2024

Sequence-to-sequence models in peer-to-peer learning: A practical application

Sequence-to-sequence models in peer-to-peer learning: A practical application

Robert v{S}ajina, Ivo Ipv{s}i'c

YC

0

Reddit

0

This paper explores the applicability of sequence-to-sequence (Seq2Seq) models based on LSTM units for Automatic Speech Recognition (ASR) task within peer-to-peer learning environments. Leveraging two distinct peer-to-peer learning methods, the study simulates the learning process of agents and evaluates their performance in ASR task using two different ASR datasets. In a centralized training setting, utilizing a scaled-down variant of the Deep Speech 2 model, a single model achieved a Word Error Rate (WER) of 84% when trained on the UserLibri dataset, and 38% when trained on the LJ Speech dataset. Conversely, in a peer-to-peer learning scenario involving 55 agents, the WER ranged from 87% to 92% for the UserLibri dataset, and from 52% to 56% for the LJ Speech dataset. The findings demonstrate the feasibility of employing Seq2Seq models in decentralized settings, albeit with slightly higher Word Error Rates (WER) compared to centralized training methods.

Read more

6/6/2024

CALRec: Contrastive Alignment of Generative LLMs For Sequential Recommendation

CALRec: Contrastive Alignment of Generative LLMs For Sequential Recommendation

Yaoyiran Li, Xiang Zhai, Moustafa Alzantot, Keyi Yu, Ivan Vuli'c, Anna Korhonen, Mohamed Hammad

YC

0

Reddit

0

Traditional recommender systems such as matrix factorization methods rely on learning a shared dense embedding space to represent both items and user preferences. Sequence models such as RNN, GRUs, and, recently, Transformers have also excelled in the task of sequential recommendation. This task requires understanding the sequential structure present in users' historical interactions to predict the next item they may like. Building upon the success of Large Language Models (LLMs) in a variety of tasks, researchers have recently explored using LLMs that are pretrained on vast corpora of text for sequential recommendation. To use LLMs in sequential recommendations, both the history of user interactions and the model's prediction of the next item are expressed in text form. We propose CALRec, a two-stage LLM finetuning framework that finetunes a pretrained LLM in a two-tower fashion using a mixture of two contrastive losses and a language modeling loss: the LLM is first finetuned on a data mixture from multiple domains followed by another round of target domain finetuning. Our model significantly outperforms many state-of-the-art baselines (+37% in Recall@1 and +24% in NDCG@10) and systematic ablation studies reveal that (i) both stages of finetuning are crucial, and, when combined, we achieve improved performance, and (ii) contrastive alignment is effective among the target domains explored in our experiments.

Read more

5/7/2024

Non-autoregressive Generative Models for Reranking Recommendation

Non-autoregressive Generative Models for Reranking Recommendation

Yuxin Ren, Qiya Yang, Yichun Wu, Wei Xu, Yalong Wang, Zhiqiang Zhang

YC

0

Reddit

0

Contemporary recommendation systems are designed to meet users' needs by delivering tailored lists of items that align with their specific demands or interests. In a multi-stage recommendation system, reranking plays a crucial role by modeling the intra-list correlations among items. The key challenge of reranking lies in the exploration of optimal sequences within the combinatorial space of permutations. Recent research proposes a generator-evaluator learning paradigm, where the generator generates multiple feasible sequences and the evaluator picks out the best sequence based on the estimated listwise score. The generator is of vital importance, and generative models are well-suited for the generator function. Current generative models employ an autoregressive strategy for sequence generation. However, deploying autoregressive models in real-time industrial systems is challenging. To address these issues, we propose a Non-AutoRegressive generative model for reranking Recommendation (NAR4Rec) designed to enhance efficiency and effectiveness. To tackle challenges such as sparse training samples and dynamic candidates, we introduce a matching model. Considering the diverse nature of user feedback, we employ a sequence-level unlikelihood training objective to differentiate feasible sequences from unfeasible ones. Additionally, to overcome the lack of dependency modeling in non-autoregressive models regarding target items, we introduce contrastive decoding to capture correlations among these items. Extensive offline experiments validate the superior performance of NAR4Rec over state-of-the-art reranking methods. Online A/B tests reveal that NAR4Rec significantly enhances the user experience. Furthermore, NAR4Rec has been fully deployed in a popular video app Kuaishou with over 300 million daily active users.

Read more

5/21/2024