A Simple and Effective Pruning Approach for Large Language Models

2306.11695

YC

2

Reddit

0

Published 5/7/2024 by Mingjie Sun, Zhuang Liu, Anna Bair, J. Zico Kolter
A Simple and Effective Pruning Approach for Large Language Models

Abstract

As their size increases, Large Languages Models (LLMs) are natural candidates for network pruning methods: approaches that drop a subset of network weights while striving to preserve performance. Existing methods, however, require either retraining, which is rarely affordable for billion-scale LLMs, or solving a weight reconstruction problem reliant on second-order information, which may also be computationally expensive. In this paper, we introduce a novel, straightforward yet effective pruning method, termed Wanda (Pruning by Weights and activations), designed to induce sparsity in pretrained LLMs. Motivated by the recent observation of emergent large magnitude features in LLMs, our approach prunes weights with the smallest magnitudes multiplied by the corresponding input activations, on a per-output basis. Notably, Wanda requires no retraining or weight update, and the pruned LLM can be used as is. We conduct a thorough evaluation of our method Wanda on LLaMA and LLaMA-2 across various language benchmarks. Wanda significantly outperforms the established baseline of magnitude pruning and performs competitively against recent method involving intensive weight update. Code is available at https://github.com/locuslab/wanda.

Get summaries of the top AI research delivered straight to your inbox:

Overview

  • Proposes a simple and effective pruning approach for large language models
  • Focuses on balancing model performance and model size reduction
  • Demonstrates the effectiveness of the approach on various language models

Plain English Explanation

The paper presents a novel pruning technique for large language models, which are complex AI systems trained on massive amounts of text data to perform tasks like natural language processing and generation. Pruning is the process of removing unnecessary connections or parameters from a trained model to reduce its size and inference time, while maintaining its performance.

The authors' approach is designed to be simple and effective, aiming to strike a balance between model performance and model size reduction. By carefully selecting which connections or parameters to remove, the pruned model can achieve significant size reduction without substantial performance degradation.

The researchers evaluate their pruning method on various popular language models, including BERT, GPT-2, and GLUE. The results demonstrate the effectiveness of their approach in achieving substantial model size reduction while maintaining model performance.

Technical Explanation

The paper proposes a pruning approach that aims to preserve the most important connections or parameters in the language model. The key steps are:

  1. Gradient-based Importance Estimation: The method calculates the gradient of the model's output with respect to each parameter, which provides a measure of the parameter's importance in the model's decision-making process.

  2. Iterative Pruning: The authors then iteratively remove the least important parameters, as determined by the gradient-based importance estimation, and fine-tune the pruned model to recover any performance degradation.

  3. Pruned Model Evaluation: The researchers evaluate the pruned model's performance on various benchmarks, such as the BESA and One-Shot datasets, to ensure the effectiveness of their pruning approach.

The experiments demonstrate that the proposed method can achieve significant model size reduction (up to 90%) without substantial performance degradation, outperforming various baseline pruning techniques.

Critical Analysis

The paper presents a practical and effective pruning approach for large language models, which is an important area of research for improving the efficiency and deployability of these complex AI systems. The authors' focus on balancing model performance and size reduction is well-justified, as it addresses a critical challenge in real-world applications.

However, the paper could have provided more discussion on the potential limitations or caveats of the proposed approach. For example, the authors could have explored the impact of the pruning method on model robustness, transferability, or fairness. Additionally, a deeper analysis of the relationship between the gradient-based importance estimation and the final model performance could shed light on the underlying mechanisms of the pruning technique.

Furthermore, the authors could have compared their approach to other recent advancements in pruning for large language models, such as the work on mixed sparsity pruning or the BESA pruning method, to provide a more comprehensive evaluation of their contribution.

Conclusion

The paper presents a simple and effective pruning approach for large language models that can achieve substantial model size reduction without significant performance degradation. The authors' focus on balancing model performance and size reduction is a crucial consideration for real-world applications of these complex AI systems.

While the paper could have delved deeper into the potential limitations and caveats of the proposed method, the overall contribution is valuable for the field of efficient and deployable language models. The results demonstrate the effectiveness of the authors' approach and provide a foundation for further research and optimization in this important area.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models

Rocktim Jyoti Das, Mingjie Sun, Liqun Ma, Zhiqiang Shen

YC

0

Reddit

0

Large Language Models (LLMs) with billions of parameters are prime targets for network pruning, removing some model weights without hurting performance. Prior approaches such as magnitude pruning, SparseGPT, and Wanda, either concentrated solely on weights or integrated weights with activations for sparsity. However, they overlooked the informative gradients derived from pretrained LLMs. In this paper, we present a novel sparsity-centric pruning method for pretrained LLMs, termed Gradient-based Language Model Pruner (GBLM-Pruner). GBLM-Pruner leverages the first-order term of the Taylor expansion, operating in a training-free manner by harnessing properly normalized gradients from a few calibration samples to determine the pruning metric, and substantially outperforms competitive counterparts like SparseGPT and Wanda in multiple benchmarks. Intriguingly, by incorporating gradients, unstructured pruning with our method tends to reveal some structural patterns, which mirrors the geometric interdependence inherent in the LLMs' parameter structure. Additionally, GBLM-Pruner functions without any subsequent retraining or weight updates to maintain its simplicity as other counterparts. Extensive evaluations on LLaMA-1 and LLaMA-2 across various benchmarks show that GBLM-Pruner surpasses magnitude pruning, Wanda and SparseGPT by significant margins. We further extend our approach on Vision Transformer. Our code and models are available at https://github.com/VILA-Lab/GBLM-Pruner.

Read more

4/10/2024

💬

Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, Danqi Chen

YC

0

Reddit

0

The popularity of LLaMA (Touvron et al., 2023a;b) and other recently emerged moderate-sized large language models (LLMs) highlights the potential of building smaller yet powerful LLMs. Regardless, the cost of training such models from scratch on trillions of tokens remains high. In this work, we study structured pruning as an effective means to develop smaller LLMs from pre-trained, larger models. Our approach employs two key techniques: (1) targeted structured pruning, which prunes a larger model to a specified target shape by removing layers, heads, and intermediate and hidden dimensions in an end-to-end manner, and (2) dynamic batch loading, which dynamically updates the composition of sampled data in each training batch based on varying losses across different domains. We demonstrate the efficacy of our approach by presenting the Sheared-LLaMA series, pruning the LLaMA2-7B model down to 1.3B and 2.7B parameters. Sheared-LLaMA models outperform state-of-the-art open-source models of equivalent sizes, such as Pythia, INCITE, OpenLLaMA and the concurrent TinyLlama models, on a wide range of downstream and instruction tuning evaluations, while requiring only 3% of compute compared to training such models from scratch. This work provides compelling evidence that leveraging existing LLMs with structured pruning is a far more cost-effective approach for building competitive small-scale LLMs

Read more

4/12/2024

Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning

Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning

Adib Hasan, Ileana Rugina, Alex Wang

YC

0

Reddit

0

Large Language Models (LLMs) are susceptible to `jailbreaking' prompts, which can induce the generation of harmful content. This paper demonstrates that moderate WANDA pruning (Sun et al., 2023) can increase their resistance to such attacks without the need for fine-tuning, while maintaining performance on standard benchmarks. Our findings suggest that the benefits of pruning correlate with the initial safety levels of the model, indicating a regularizing effect of WANDA pruning. We introduce a dataset of 225 harmful tasks across five categories to systematically evaluate this safety enhancement. We argue that safety improvements can be understood through a regularization perspective. First, we show that pruning helps LLMs focus more effectively on task-relevant tokens within jailbreaking prompts. Then, we analyze the effects of pruning on the perplexity of malicious prompts before and after their integration into jailbreak templates. Finally, we demonstrate statistically significant performance improvements under domain shifts when applying WANDA to linear models.

Read more

4/30/2024

Efficient Pruning of Large Language Model with Adaptive Estimation Fusion

Efficient Pruning of Large Language Model with Adaptive Estimation Fusion

Jun Liu, Chao Wu, Changdi Yang, Hao Tang, Zhenglun Kong, Geng Yuan, Wei Niu, Dong Huang, Yanzhi Wang

YC

0

Reddit

0

Large language models (LLMs) have become crucial for many generative downstream tasks, leading to an inevitable trend and significant challenge to deploy them efficiently on resource-constrained devices. Structured pruning is a widely used method to address this challenge. However, when dealing with the complex structure of the multiple decoder layers, general methods often employ common estimation approaches for pruning. These approaches lead to a decline in accuracy for specific downstream tasks. In this paper, we introduce a simple yet efficient method that adaptively models the importance of each substructure. Meanwhile, it can adaptively fuse coarse-grained and finegrained estimations based on the results from complex and multilayer structures. All aspects of our design seamlessly integrate into the endto-end pruning framework. Our experimental results, compared with state-of-the-art methods on mainstream datasets, demonstrate average accuracy improvements of 1.1%, 1.02%, 2.0%, and 1.2% for LLaMa-7B,Vicuna-7B, Baichuan-7B, and Bloom-7b1, respectively.

Read more

5/16/2024