Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences

2404.03715

YC

52

Reddit

1

Published 4/8/2024 by Corby Rosset, Ching-An Cheng, Arindam Mitra, Michael Santacroce, Ahmed Awadallah, Tengyang Xie
Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences

Abstract

This paper studies post-training large language models (LLMs) using preference feedback from a powerful oracle to help a model iteratively improve over itself. The typical approach for post-training LLMs involves Reinforcement Learning from Human Feedback (RLHF), which traditionally separates reward learning and subsequent policy optimization. However, such a reward maximization approach is limited by the nature of point-wise rewards (such as Bradley-Terry model), which fails to express complex intransitive or cyclic preference relations. While advances on RLHF show reward learning and policy optimization can be merged into a single contrastive objective for stability, they yet still remain tethered to the reward maximization framework. Recently, a new wave of research sidesteps the reward maximization presumptions in favor of directly optimizing over pair-wise or general preferences. In this paper, we introduce Direct Nash Optimization (DNO), a provable and scalable algorithm that marries the simplicity and stability of contrastive learning with theoretical generality from optimizing general preferences. Because DNO is a batched on-policy algorithm using a regression-based objective, its implementation is straightforward and efficient. Moreover, DNO enjoys monotonic improvement across iterations that help it improve even over a strong teacher (such as GPT-4). In our experiments, a resulting 7B parameter Orca-2.5 model aligned by DNO achieves the state-of-the-art win-rate against GPT-4-Turbo of 33% on AlpacaEval 2.0 (even after controlling for response length), an absolute gain of 26% (7% to 33%) over the initializing model. It outperforms models with far more parameters, including Mistral Large, Self-Rewarding LM (70B parameters), and older versions of GPT-4.

Get summaries of the top AI research delivered straight to your inbox:

Overview

  • This paper introduces a novel approach called "Direct Nash Optimization" (DNO) for teaching language models to self-improve with general preferences.
  • The key idea is to formulate the model training process as a game between the language model and a reward model, where the language model tries to generate outputs that maximize the reward model's score.
  • This approach is designed to be more flexible and scalable than existing techniques like Reinforcement Learning from Human Feedback (RLHF), which rely on hand-crafted reward functions.

Plain English Explanation

The researchers have developed a new method called "Direct Nash Optimization" (DNO) to help language models like GPT-3 or GPT-4 get better at tasks over time.

The basic idea is to set up a "game" between the language model and another model called a "reward model". The reward model's job is to evaluate how good the language model's outputs are, based on some general preferences or goals. The language model then tries to generate outputs that maximize the reward model's score.

This is different from other approaches like Reinforcement Learning from Human Feedback (RLHF), which rely on humans manually defining reward functions. With DNO, the reward model can learn more general preferences, making the system more flexible and scalable.

The key advantage of this approach is that it allows the language model to keep improving itself, without needing constant human oversight or intervention. The language model essentially learns to "self-improve" by optimizing for the reward model's preferences.

Technical Explanation

The paper introduces a new training framework called "Direct Nash Optimization" (DNO) that aims to teach language models to self-improve according to general preferences, rather than relying on hand-crafted reward functions.

The core idea is to formulate the model training process as a game between the language model and a reward model. The language model tries to generate outputs that maximize the reward model's score, while the reward model tries to accurately capture the desired preferences. This "Nash equilibrium" leads the language model to learn to generate outputs that align with the general preferences encoded in the reward model.

The authors show that this approach has several advantages over existing techniques like Reinforcement Learning from Human Feedback (RLHF). First, the reward model can learn more general preferences, rather than being limited to specific reward functions. Second, the language model can keep improving itself through this optimization process, without needing constant human oversight.

Theoretically, the authors provide convergence guarantees for this approach under certain assumptions, and demonstrate its robustness to noise in the reward model.

Critical Analysis

The paper presents a compelling and well-grounded framework for teaching language models to self-improve according to general preferences. The theoretical analysis and empirical results suggest that DNO has the potential to be a more flexible and scalable alternative to RLHF.

However, the authors acknowledge several limitations and areas for further research. For example, the current formulation assumes that the reward model is static, whereas in practice, it may need to adapt and evolve over time. Additionally, the authors note that the performance of DNO may depend on the specific architecture and training procedures used for the language model and reward model.

Another potential concern is the risk of reward model misspecification, where the preferences encoded in the reward model may not fully align with the desired outcomes. This could lead to unintended consequences or behaviors from the language model. Careful monitoring and evaluation of the reward model's performance would be crucial in such cases.

Finally, the authors do not address the potential computational and resource requirements of the DNO approach, which could be a practical concern for deploying these systems at scale. Techniques like Online Control and Adaptive Large Neighborhood Search may be helpful in addressing these challenges.

Conclusion

The "Direct Nash Optimization" framework introduced in this paper represents a significant step forward in teaching language models to self-improve according to general preferences. By formulating the training process as a game between the language model and a reward model, the authors have developed a more flexible and scalable approach than existing techniques like RLHF.

While the paper highlights several promising theoretical and empirical results, it also acknowledges important limitations and areas for further research. Careful attention to reward model specification, computational efficiency, and potential unintended consequences will be crucial as this approach is further developed and deployed in real-world applications.

Overall, the DNO framework is a valuable contribution to the field of language model optimization, and it will be exciting to see how it evolves and is applied to address the growing demand for highly capable and aligned AI systems.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Robust Preference Optimization through Reward Model Distillation

Robust Preference Optimization through Reward Model Distillation

Adam Fisch, Jacob Eisenstein, Vicky Zayats, Alekh Agarwal, Ahmad Beirami, Chirag Nagpal, Pete Shaw, Jonathan Berant

YC

0

Reddit

0

Language model (LM) post-training (or alignment) involves maximizing a reward function that is derived from preference annotations. Direct Preference Optimization (DPO) is a popular offline alignment method that trains a policy directly on preference data without the need to train a reward model or apply reinforcement learning. However, typical preference datasets have only a single, or at most a few, annotation per preference pair, which causes DPO to overconfidently assign rewards that trend towards infinite magnitude. This frequently leads to degenerate policies, sometimes causing even the probabilities of the preferred generations to go to zero. In this work, we analyze this phenomenon and propose distillation to get a better proxy for the true preference distribution over generation pairs: we train the LM to produce probabilities that match the distribution induced by a reward model trained on the preference data. Moreover, to account for uncertainty in the reward model we are distilling from, we optimize against a family of reward models that, as a whole, is likely to include at least one reasonable proxy for the preference distribution. Our results show that distilling from such a family of reward models leads to improved robustness to distribution shift in preference annotations, while preserving the simple supervised nature of DPO.

Read more

5/30/2024

🤯

Would I Lie To You? Inference Time Alignment of Language Models using Direct Preference Heads

Avelina Asada Hadji-Kyriacou, Ognjen Arandjelovic

YC

0

Reddit

0

Pre-trained Language Models (LMs) exhibit strong zero-shot and in-context learning capabilities; however, their behaviors are often difficult to control. By utilizing Reinforcement Learning from Human Feedback (RLHF), it is possible to fine-tune unsupervised LMs to follow instructions and produce outputs that reflect human preferences. Despite its benefits, RLHF has been shown to potentially harm a language model's reasoning capabilities and introduce artifacts such as hallucinations where the model may fabricate facts. To address this issue we introduce Direct Preference Heads (DPH), a fine-tuning framework that enables LMs to learn human preference signals through an auxiliary reward head without directly affecting the output distribution of the language modeling head. We perform a theoretical analysis of our objective function and find strong ties to Conservative Direct Preference Optimization (cDPO). Finally we evaluate our models on GLUE, RACE, and the GPT4All evaluation suite and demonstrate that our method produces models which achieve higher scores than those fine-tuned with Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO) alone.

Read more

5/31/2024

🏅

Offline Regularised Reinforcement Learning for Large Language Models Alignment

Pierre Harvey Richemond, Yunhao Tang, Daniel Guo, Daniele Calandriello, Mohammad Gheshlaghi Azar, Rafael Rafailov, Bernardo Avila Pires, Eugene Tarassov, Lucas Spangher, Will Ellsworth, Aliaksei Severyn, Jonathan Mallinson, Lior Shani, Gil Shamir, Rishabh Joshi, Tianqi Liu, Remi Munos, Bilal Piot

YC

0

Reddit

0

The dominant framework for alignment of large language models (LLM), whether through reinforcement learning from human feedback or direct preference optimisation, is to learn from preference data. This involves building datasets where each element is a quadruplet composed of a prompt, two independent responses (completions of the prompt) and a human preference between the two independent responses, yielding a preferred and a dis-preferred response. Such data is typically scarce and expensive to collect. On the other hand, emph{single-trajectory} datasets where each element is a triplet composed of a prompt, a response and a human feedback is naturally more abundant. The canonical element of such datasets is for instance an LLM's response to a user's prompt followed by a user's feedback such as a thumbs-up/down. Consequently, in this work, we propose DRO, or emph{Direct Reward Optimisation}, as a framework and associated algorithms that do not require pairwise preferences. DRO uses a simple mean-squared objective that can be implemented in various ways. We validate our findings empirically, using T5 encoder-decoder language models, and show DRO's performance over selected baselines such as Kahneman-Tversky Optimization (KTO). Thus, we confirm that DRO is a simple and empirically compelling method for single-trajectory policy optimisation.

Read more

5/30/2024

Hybrid Preference Optimization: Augmenting Direct Preference Optimization with Auxiliary Objectives

Hybrid Preference Optimization: Augmenting Direct Preference Optimization with Auxiliary Objectives

Anirudhan Badrinath, Prabhat Agarwal, Jiajing Xu

YC

0

Reddit

0

For aligning large language models (LLMs), prior work has leveraged reinforcement learning via human feedback (RLHF) or variations of direct preference optimization (DPO). While DPO offers a simpler framework based on maximum likelihood estimation, it compromises on the ability to tune language models to easily maximize non-differentiable and non-binary objectives according to the LLM designer's preferences (e.g., using simpler language or minimizing specific kinds of harmful content). These may neither align with user preferences nor even be able to be captured tractably by binary preference data. To leverage the simplicity and performance of DPO with the generalizability of RL, we propose a hybrid approach between DPO and RLHF. With a simple augmentation to the implicit reward decomposition of DPO, we allow for tuning LLMs to maximize a set of arbitrary auxiliary rewards using offline RL. The proposed method, Hybrid Preference Optimization (HPO), shows the ability to effectively generalize to both user preferences and auxiliary designer objectives, while preserving alignment performance across a range of challenging benchmarks and model sizes.

Read more

5/31/2024