0
0
Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods
Overview
- The paper explores techniques for "pruning" or compressing modern large language models (LLMs) to reduce their high computational needs.
- It compares two main approaches: "width pruning" (reducing the size of projection weight matrices) and "depth pruning" (removing entire layers or blocks).
- The key finding is that simple depth pruning can effectively compress LLMs while achieving comparable or superior performance to recent width pruning methods.
Plain English Explanation
Large language models (LLMs) like GPT-3 and BERT are incredibly powerful, but they also require a lot of computing power to run. This makes them difficult to deploy in resource-constrained environments like mobile devices. To address this, researchers have been exploring ways to "prune" or compress these models without losing too much performance.
The paper looks at two main approaches to pruning:
-
Width pruning: This involves reducing the size of the weight matrices used in the model, for example by removing some of the attention heads. This can shrink the model's overall size, but the number of layers remains the same.
-
Depth pruning: This method removes entire layers or blocks from the model, while keeping the remaining weights unchanged. This can achieve more dramatic model compression.
The key finding is that simple depth pruning can be just as effective as more complex width pruning techniques, and in some cases even better. Depth pruning was especially helpful for improving inference speed (how quickly the model can make predictions) in situations where memory is limited, and the model has to run with small batch sizes.
When retraining the pruned models to recover performance, the researchers found that continued pretraining on a large corpus was much more effective than a technique called LoRA, particularly for models that were heavily pruned.
Overall, this work suggests that depth pruning could be a simpler and more effective way to build compact yet capable language models, especially for deployment on resource-constrained devices.
Technical Explanation
The paper explores two main approaches to structured pruning of large language models (LLMs):
-
Width pruning: This involves reducing the size of the projection weight matrices, such as by removing attention heads. The number of layers in the model remains unchanged.
-
Depth pruning: This method removes entire layers or blocks from the model, while keeping the size of the remaining weights unchanged. This can achieve more dramatic model compression.
Most prior research has focused on width-only pruning or a blend of width and depth pruning, with little comparative analysis between the two.
In this work, the authors show that simple depth pruning can effectively compress LLMs while achieving comparable or superior performance to recent width pruning studies. Their depth pruning method boosts inference speeds, especially under memory-constrained conditions that require limited batch sizes, where width pruning is less effective.
When retraining the pruned models to recover performance, the authors found that continued pretraining on a large corpus markedly outperforms LoRA-based tuning, particularly at severe pruning ratios.
Critical Analysis
The paper provides a thorough evaluation of depth pruning as an alternative to more commonly studied width pruning techniques. The authors acknowledge that their depth pruning method is a relatively simple approach, but argue that it can be just as effective as more complex methods.
One potential limitation of the research is that it primarily focuses on comparing depth and width pruning, without exploring more advanced pruning strategies that combine the two techniques, as mentioned in related work. It would be interesting to see how a hybrid approach might perform.
Additionally, the paper does not delve into the theoretical reasons why depth pruning may be more effective than width pruning in certain scenarios, such as memory-constrained inference. Further analysis of the underlying mechanisms could provide valuable insights.
Overall, the research makes a compelling case for depth pruning as a simple yet powerful technique for compressing large language models. The findings could have important implications for deploying these models in resource-constrained environments, and the authors' code and models provide a useful starting point for further exploration.
Conclusion
This paper demonstrates that simple depth pruning can be an effective and efficient way to compress large language models, often outperforming more complex width pruning techniques. The depth pruning method boosts inference speeds, especially in memory-constrained conditions, and continued pretraining is shown to be more effective than LoRA-based tuning for recovering model performance after pruning.
These findings suggest that depth pruning could be a promising approach for building compact yet capable language models, particularly for deployment on resource-constrained devices. The work contributes to the ongoing efforts to make large language models more accessible and practical for a wider range of applications.
This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!
0
Related Papers
0
A deeper look at depth pruning of LLMs
Shoaib Ahmed Siddiqui, Xin Dong, Greg Heinrich, Thomas Breuel, Jan Kautz, David Krueger, Pavlo Molchanov
Large Language Models (LLMs) are not only resource-intensive to train but even more costly to deploy in production. Therefore, recent work has attempted to prune blocks of LLMs based on cheap proxies for estimating block importance, effectively removing 10% of blocks in well-trained LLaMa-2 and Mistral 7b models without any significant degradation of downstream metrics. In this paper, we explore different block importance metrics by considering adaptive metrics such as Shapley value in addition to static ones explored in prior work. We show that adaptive metrics exhibit a trade-off in performance between tasks i.e., improvement on one task may degrade performance on the other due to differences in the computed block influences. Furthermore, we extend this analysis from a complete block to individual self-attention and feed-forward layers, highlighting the propensity of the self-attention layers to be more amendable to pruning, even allowing removal of upto 33% of the self-attention layers without incurring any performance degradation on MMLU for Mistral 7b (significant reduction in costly maintenance of KV-cache). Finally, we look at simple performance recovery techniques to emulate the pruned layers by training lightweight additive bias or low-rank linear adapters. Performance recovery using emulated updates avoids performance degradation for the initial blocks (up to 5% absolute improvement on MMLU), which is either competitive or superior to the learning-based technique.
Read more7/24/2024
0
Pruning Foundation Models for High Accuracy without Retraining
Pu Zhao, Fei Sun, Xuan Shen, Pinrui Yu, Zhenglun Kong, Yanzhi Wang, Xue Lin
Despite the superior performance, it is challenging to deploy foundation models or large language models (LLMs) due to their massive parameters and computations. While pruning is a promising technique to reduce model size and accelerate the inference, the traditional pruning techniques can hardly be applied for LLMs as they need to finetune the model on the full dataset with multiple epochs consuming massive data and hardware resources. To deal with this problem, post-training pruning methods are proposed to prune LLMs in one-shot without retraining. However, their accuracy after pruning may suffer from certain performance degradation due to the lack of retraining with massive data. To address this issue, in this paper, we first formulate the post-training problem for layer-wise LLM compression to simultaneously prune multiple weights in LLMs. Next, we provide an optimal solution for this problem and design our post-training pruning algorithm for both unstructured and semi-structured sparsity. Our extensive experiments demonstrate the superior performance of the proposed methods in comparison to SOTA baselines across various LLM families including transformer-based LLMs and Mamba-based LLMs. Code link: https://github.com/piuzha/APT
Read more10/22/2024
💬
0
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, Danqi Chen
The popularity of LLaMA (Touvron et al., 2023a;b) and other recently emerged moderate-sized large language models (LLMs) highlights the potential of building smaller yet powerful LLMs. Regardless, the cost of training such models from scratch on trillions of tokens remains high. In this work, we study structured pruning as an effective means to develop smaller LLMs from pre-trained, larger models. Our approach employs two key techniques: (1) targeted structured pruning, which prunes a larger model to a specified target shape by removing layers, heads, and intermediate and hidden dimensions in an end-to-end manner, and (2) dynamic batch loading, which dynamically updates the composition of sampled data in each training batch based on varying losses across different domains. We demonstrate the efficacy of our approach by presenting the Sheared-LLaMA series, pruning the LLaMA2-7B model down to 1.3B and 2.7B parameters. Sheared-LLaMA models outperform state-of-the-art open-source models of equivalent sizes, such as Pythia, INCITE, OpenLLaMA and the concurrent TinyLlama models, on a wide range of downstream and instruction tuning evaluations, while requiring only 3% of compute compared to training such models from scratch. This work provides compelling evidence that leveraging existing LLMs with structured pruning is a far more cost-effective approach for building competitive small-scale LLMs
Read more4/12/2024
3
Compact Language Models via Pruning and Knowledge Distillation
Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov
Large language models (LLMs) targeting different deployment scales and sizes are currently produced by training each variant from scratch; this is extremely compute-intensive. In this paper, we investigate if pruning an existing LLM and then re-training it with a fraction (<3%) of the original training data can be a suitable alternative to repeated, full retraining. To this end, we develop a set of practical and effective compression best practices for LLMs that combine depth, width, attention and MLP pruning with knowledge distillation-based retraining; we arrive at these best practices through a detailed empirical exploration of pruning strategies for each axis, methods to combine axes, distillation strategies, and search techniques for arriving at optimal compressed architectures. We use this guide to compress the Nemotron-4 family of LLMs by a factor of 2-4x, and compare their performance to similarly-sized models on a variety of language modeling tasks. Deriving 8B and 4B models from an already pretrained 15B model using our approach requires up to 40x fewer training tokens per model compared to training from scratch; this results in compute cost savings of 1.8x for training the full model family (15B, 8B, and 4B). Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch, perform comparably to other community models such as Mistral 7B, Gemma 7B and Llama-3 8B, and outperform state-of-the-art compression techniques from the literature. We have open-sourced Minitron model weights on Huggingface, with corresponding supplementary material including example code available on GitHub.
Read more11/5/2024