DoRA: Weight-Decomposed Low-Rank Adaptation

2402.09353

YC

0

Reddit

91

Published 6/4/2024 by Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Min-Hung Chen

🌀

Abstract

Among the widely used parameter-efficient fine-tuning (PEFT) methods, LoRA and its variants have gained considerable popularity because of avoiding additional inference costs. However, there still often exists an accuracy gap between these methods and full fine-tuning (FT). In this work, we first introduce a novel weight decomposition analysis to investigate the inherent differences between FT and LoRA. Aiming to resemble the learning capacity of FT from the findings, we propose Weight-Decomposed Low-Rank Adaptation (DoRA). DoRA decomposes the pre-trained weight into two components, magnitude and direction, for fine-tuning, specifically employing LoRA for directional updates to efficiently minimize the number of trainable parameters. By employing ours, we enhance both the learning capacity and training stability of LoRA while avoiding any additional inference overhead. ours~consistently outperforms LoRA on fine-tuning LLaMA, LLaVA, and VL-BART on various downstream tasks, such as commonsense reasoning, visual instruction tuning, and image/video-text understanding. Code is available at https://github.com/NVlabs/DoRA.

Get summaries of the top AI research delivered straight to your inbox:

Overview

  • Introduces a novel weight decomposition analysis to investigate the differences between full fine-tuning (FT) and Low-Rank Adaptation (LoRA)
  • Proposes a new method called Weight-Decomposed Low-Rank Adaptation (DoRA) to enhance the learning capacity and training stability of LoRA
  • DoRA fine-tunes the pre-trained weight into two components - magnitude and direction - and uses LoRA for efficient directional updates
  • DoRA consistently outperforms LoRA on fine-tuning large language models like LLaMA, LLaVA, and VL-BART on various downstream tasks

Plain English Explanation

Among the popular parameter-efficient fine-tuning (PEFT) methods, LoRA and its variants have become widely used because they don't add extra costs during model inference. However, these methods often still have an accuracy gap compared to fully fine-tuning (FT) the entire model.

This research aims to address this gap by first taking a close look at the differences between FT and LoRA. Based on their findings, the researchers propose a new method called Weight-Decomposed Low-Rank Adaptation (DoRA). DoRA splits the pre-trained model weights into two parts - the magnitude (or scale) and the direction. Then, it uses LoRA to efficiently update just the directional component, without changing the overall magnitude.

By handling the weights in this way, DoRA is able to match the learning capacity of full fine-tuning, while still maintaining the efficiency advantages of LoRA. The researchers show that DoRA consistently outperforms standard LoRA when fine-tuning large language models like LLaMA, LLaVA, and VL-BART on a variety of tasks, such as commonsense reasoning, visual instruction tuning, and understanding image/video and text together.

Technical Explanation

The researchers first conduct a novel weight decomposition analysis to investigate the inherent differences between full fine-tuning (FT) and Low-Rank Adaptation (LoRA). They find that FT updates both the magnitude and direction of the pre-trained weights, while LoRA mainly updates the direction.

Aiming to bridge this gap and resemble the learning capacity of FT, the researchers propose Weight-Decomposed Low-Rank Adaptation (DoRA). DoRA decomposes the pre-trained weights into two components - magnitude and direction. It then employs LoRA specifically for the directional updates, in order to efficiently minimize the number of trainable parameters.

By handling the weights in this way, DoRA is able to enhance both the learning capacity and training stability of LoRA, without any additional inference overhead. The researchers evaluate DoRA on fine-tuning large language models like LLaMA, LLaVA, and VL-BART on various downstream tasks. DoRA consistently outperforms standard LoRA across these experiments.

Critical Analysis

The paper provides a thorough analysis of the differences between full fine-tuning (FT) and Low-Rank Adaptation (LoRA), and introduces a novel method (DoRA) to bridge the accuracy gap between these approaches. The weight decomposition analysis offers valuable insights into how these methods update the pre-trained weights.

However, the paper does not discuss potential limitations or caveats of the DoRA method. For example, it is unclear how DoRA would perform on smaller or more challenging datasets, or how it compares to other PEFT methods like Batched Low-Rank Adaptation, mT-LoRA, or AdaFLORA.

Additionally, while the results are promising, the paper does not provide much insight into the underlying reasons for DoRA's improved performance. Further analysis of the learned weights or the optimization dynamics could help explain the sources of these gains.

Conclusion

This research introduces a novel weight decomposition approach called DoRA that enhances the learning capacity and training stability of the popular Low-Rank Adaptation (LoRA) method, while maintaining its efficiency advantages. By decomposing the pre-trained weights into magnitude and direction components, and using LoRA only for the directional updates, DoRA is able to consistently outperform standard LoRA on fine-tuning large language models across a variety of downstream tasks.

This work represents an important step forward in parameter-efficient fine-tuning, offering a more effective way to adapt pre-trained models to specific applications without incurring significant computational overhead. The insights from the weight decomposition analysis could also inform the development of other PEFT techniques in the future.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

DoRA: Enhancing Parameter-Efficient Fine-Tuning with Dynamic Rank Distribution

DoRA: Enhancing Parameter-Efficient Fine-Tuning with Dynamic Rank Distribution

Yulong Mao, Kaiyu Huang, Changhao Guan, Ganglin Bao, Fengran Mo, Jinan Xu

YC

0

Reddit

0

Fine-tuning large-scale pre-trained models is inherently a resource-intensive task. While it can enhance the capabilities of the model, it also incurs substantial computational costs, posing challenges to the practical application of downstream tasks. Existing parameter-efficient fine-tuning (PEFT) methods such as Low-Rank Adaptation (LoRA) rely on a bypass framework that ignores the differential parameter budget requirements across weight matrices, which may lead to suboptimal fine-tuning outcomes. To address this issue, we introduce the Dynamic Low-Rank Adaptation (DoRA) method. DoRA decomposes high-rank LoRA layers into structured single-rank components, allowing for dynamic pruning of parameter budget based on their importance to specific tasks during training, which makes the most of the limited parameter budget. Experimental results demonstrate that DoRA can achieve competitive performance compared with LoRA and full model fine-tuning, and outperform various strong baselines with the same storage parameter budget. Our code is available at https://github.com/MIkumikumi0116/DoRA

Read more

5/29/2024

ALoRA: Allocating Low-Rank Adaptation for Fine-tuning Large Language Models

ALoRA: Allocating Low-Rank Adaptation for Fine-tuning Large Language Models

Zequan Liu, Jiawen Lyn, Wei Zhu, Xing Tian, Yvette Graham

YC

0

Reddit

0

Parameter-efficient fine-tuning (PEFT) is widely studied for its effectiveness and efficiency in the era of large language models. Low-rank adaptation (LoRA) has demonstrated commendable performance as a popular and representative method. However, it is implemented with a fixed intrinsic rank that might not be the ideal setting for the downstream tasks. Recognizing the need for more flexible downstream task adaptation, we extend the methodology of LoRA to an innovative approach we call allocating low-rank adaptation (ALoRA) that enables dynamic adjustments to the intrinsic rank during the adaptation process. First, we propose a novel method, AB-LoRA, that can effectively estimate the importance score of each LoRA rank. Second, guided by AB-LoRA, we gradually prune abundant and negatively impacting LoRA ranks and allocate the pruned LoRA budgets to important Transformer modules needing higher ranks. We have conducted experiments on various tasks, and the experimental results demonstrate that our ALoRA method can outperform the recent baselines with comparable tunable parameters.

Read more

4/16/2024

⚙️

Computational Limits of Low-Rank Adaptation (LoRA) for Transformer-Based Models

Jerry Yao-Chieh Hu, Maojiang Su, En-Jui Kuo, Zhao Song, Han Liu

YC

0

Reddit

0

We study the computational limits of Low-Rank Adaptation (LoRA) update for finetuning transformer-based models using fine-grained complexity theory. Our key observation is that the existence of low-rank decompositions within the gradient computation of LoRA adaptation leads to possible algorithmic speedup. This allows us to (i) identify a phase transition behavior and (ii) prove the existence of nearly linear algorithms by controlling the LoRA update computation term by term, assuming the Strong Exponential Time Hypothesis (SETH). For the former, we identify a sharp transition in the efficiency of all possible rank-$r$ LoRA update algorithms for transformers, based on specific norms resulting from the multiplications of the input sequence $mathbf{X}$, pretrained weights $mathbf{W^star}$, and adapter matrices $alpha mathbf{B} mathbf{A} / r$. Specifically, we derive a shared upper bound threshold for such norms and show that efficient (sub-quadratic) approximation algorithms of LoRA exist only below this threshold. For the latter, we prove the existence of nearly linear approximation algorithms for LoRA adaptation by utilizing the hierarchical low-rank structures of LoRA gradients and approximating the gradients with a series of chained low-rank approximations. To showcase our theory, we consider two practical scenarios: partial (e.g., only $mathbf{W}_V$ and $mathbf{W}_Q$) and full adaptations (e.g., $mathbf{W}_Q$, $mathbf{W}_V$, and $mathbf{W}_K$) of weights in attention heads.

Read more

6/6/2024

LoRA Learns Less and Forgets Less

LoRA Learns Less and Forgets Less

Dan Biderman, Jose Gonzalez Ortiz, Jacob Portes, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, John P. Cunningham

YC

0

Reddit

0

Low-Rank Adaptation (LoRA) is a widely-used parameter-efficient finetuning method for large language models. LoRA saves memory by training only low rank perturbations to selected weight matrices. In this work, we compare the performance of LoRA and full finetuning on two target domains, programming and mathematics. We consider both the instruction finetuning ($approx$100K prompt-response pairs) and continued pretraining ($approx$10B unstructured tokens) data regimes. Our results show that, in most settings, LoRA substantially underperforms full finetuning. Nevertheless, LoRA exhibits a desirable form of regularization: it better maintains the base model's performance on tasks outside the target domain. We show that LoRA provides stronger regularization compared to common techniques such as weight decay and dropout; it also helps maintain more diverse generations. We show that full finetuning learns perturbations with a rank that is 10-100X greater than typical LoRA configurations, possibly explaining some of the reported gaps. We conclude by proposing best practices for finetuning with LoRA.

Read more

5/17/2024