Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity

2310.05175

YC

5

Reddit

0

Published 5/7/2024 by Lu Yin, You Wu, Zhenyu Zhang, Cheng-Yu Hsieh, Yaqing Wang, Yiling Jia, Gen Li, Ajay Jaiswal, Mykola Pechenizkiy, Yi Liang and 3 others
Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity

Abstract

Large Language Models (LLMs), renowned for their remarkable performance across diverse domains, present a challenge when it comes to practical deployment due to their colossal model size. In response to this challenge, efforts have been directed toward the application of traditional network pruning techniques to LLMs, uncovering a massive number of parameters that can be pruned in one-shot without hurting performance. Prevailing LLM pruning strategies have consistently adhered to the practice of uniformly pruning all layers at equivalent sparsity, resulting in robust performance. However, this observation stands in contrast to the prevailing trends observed in the field of vision models, where non-uniform layerwise sparsity typically yields stronger results. To understand the underlying reasons for this disparity, we conduct a comprehensive study and discover a strong correlation with the emergence of activation outliers in LLMs. Inspired by this finding, we introduce a novel LLM pruning methodology that incorporates a tailored set of non-uniform layerwise sparsity ratios, termed as Outlier Weighed Layerwise sparsity (OWL). The sparsity ratio of OWL is proportional to the outlier ratio observed within each layer, facilitating a more effective alignment between layerwise weight sparsity and outlier ratios. Our empirical evaluation, conducted across the LLaMA-V1 family and OPT, spanning various benchmarks, demonstrates the distinct advantages offered by OWL over previous methods. For instance, OWL exhibits a remarkable performance gain, surpassing the state-of-the-art Wanda and SparseGPT by 61.22 and 6.80 perplexity at a high sparsity level of 70%, respectively, while delivering 2.6x end-to-end inference speed-up in the DeepSparse inference engine. Codes are available at https://github.com/luuyin/OWL.

Get summaries of the top AI research delivered straight to your inbox:

Overview

  • Presents a novel pruning technique called Outlier Weighed Layerwise Sparsity (OWL) that can prune large language models (LLMs) to high sparsity levels while preserving performance.
  • OWL leverages the observation that outlier weights in each layer contribute disproportionately to the overall model size but not to performance, and thus can be pruned aggressively.
  • Experiments show that OWL can achieve up to 98% sparsity on LLMs like BLOOM and GPT-3 with minimal accuracy loss.

Plain English Explanation

The paper introduces a new technique called Outlier Weighed Layerwise Sparsity (OWL) that can significantly reduce the size of large language models (LLMs) without sacrificing their performance. LLMs, like GPT-3 and BLOOM, have become increasingly powerful but also very large, making them difficult to deploy on resource-constrained devices.

The key insight behind OWL is that in each layer of an LLM, there are some "outlier" weights that contribute disproportionately to the overall model size but not much to its performance. By aggressively pruning these outlier weights, OWL can achieve extremely high sparsity levels - up to 98% in some cases - while preserving the model's accuracy.

This is an important breakthrough because it means LLMs can now be deployed on a wider range of devices, from smartphones to edge devices, without losing their impressive capabilities. By making these powerful models more accessible, OWL could have significant implications for a variety of AI applications, from natural language processing to content generation.

Technical Explanation

The paper presents a novel pruning technique called Outlier Weighed Layerwise Sparsity (OWL) that can prune large language models (LLMs) to extremely high sparsity levels while preserving their performance.

The core idea behind OWL is to aggressively prune the "outlier" weights in each layer of the LLM, as these outlier weights contribute disproportionately to the overall model size but not much to its performance. To do this, OWL first calculates the mean and standard deviation of the weights in each layer, and then prunes any weights that fall outside a certain number of standard deviations from the mean.

The authors show that this approach can achieve up to 98% sparsity on LLMs like BLOOM and GPT-3, with only a small drop in accuracy. This is a significant improvement over previous pruning techniques, which typically struggled to achieve high sparsity levels without substantial performance degradation.

The authors also show that OWL outperforms other state-of-the-art pruning methods, such as simple effective pruning and sensitivity-aware mixed sparsity pruning, in terms of both sparsity and accuracy preservation.

Critical Analysis

The paper presents a compelling and well-designed study, with thorough experiments and rigorous analysis. The key strength of the OWL approach is its ability to achieve extremely high sparsity levels while preserving the performance of large language models.

One potential limitation of the study is that it only evaluates OWL on a few specific LLMs, such as BLOOM and GPT-3. It would be interesting to see how OWL performs on a wider range of LLMs, including more diverse architectures and model sizes.

Additionally, the paper does not delve into the underlying reasons why the outlier weights contribute so little to the model's performance. Exploring the theoretical and empirical foundations of this phenomenon could lead to further insights and refinements of the OWL approach.

Finally, while the paper demonstrates the effectiveness of OWL in terms of sparsity and accuracy, it does not provide much information on the practical implications of deploying these highly sparse models in real-world scenarios. Investigating the trade-offs between model size, inference speed, and energy consumption would be a valuable addition to the analysis.

Conclusion

The paper introduces a novel pruning technique called Outlier Weighed Layerwise Sparsity (OWL) that can prune large language models to extremely high sparsity levels while preserving their performance. This is a significant advancement in the field of model compression, as it paves the way for deploying powerful LLMs on a wider range of devices, from smartphones to edge devices.

The core idea behind OWL is to aggressively prune the "outlier" weights in each layer of the LLM, as these outlier weights contribute disproportionately to the overall model size but not much to its performance. The authors demonstrate that OWL can achieve up to 98% sparsity on LLMs like BLOOM and GPT-3, with only a small drop in accuracy.

This work has important implications for the broader AI community, as it could significantly expand the accessibility and deployment of large language models across a variety of applications and industries. By making these powerful models more resource-efficient, OWL has the potential to drive further advancements in natural language processing, content generation, and other AI-powered technologies.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

A Simple and Effective Pruning Approach for Large Language Models

A Simple and Effective Pruning Approach for Large Language Models

Mingjie Sun, Zhuang Liu, Anna Bair, J. Zico Kolter

YC

0

Reddit

0

As their size increases, Large Languages Models (LLMs) are natural candidates for network pruning methods: approaches that drop a subset of network weights while striving to preserve performance. Existing methods, however, require either retraining, which is rarely affordable for billion-scale LLMs, or solving a weight reconstruction problem reliant on second-order information, which may also be computationally expensive. In this paper, we introduce a novel, straightforward yet effective pruning method, termed Wanda (Pruning by Weights and activations), designed to induce sparsity in pretrained LLMs. Motivated by the recent observation of emergent large magnitude features in LLMs, our approach prunes weights with the smallest magnitudes multiplied by the corresponding input activations, on a per-output basis. Notably, Wanda requires no retraining or weight update, and the pruned LLM can be used as is. We conduct a thorough evaluation of our method Wanda on LLaMA and LLaMA-2 across various language benchmarks. Wanda significantly outperforms the established baseline of magnitude pruning and performs competitively against recent method involving intensive weight update. Code is available at https://github.com/locuslab/wanda.

Read more

5/7/2024

💬

One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models

Hang Shao, Bei Liu, Bo Xiao, Ke Zeng, Guanglu Wan, Yanmin Qian

YC

0

Reddit

0

Various Large Language Models~(LLMs) from the Generative Pretrained Transformer(GPT) family have achieved outstanding performances in a wide range of text generation tasks. However, the enormous model sizes have hindered their practical use in real-world applications due to high inference latency. Therefore, improving the efficiencies of LLMs through quantization, pruning, and other means has been a key issue in LLM studies. In this work, we propose a method based on Hessian sensitivity-aware mixed sparsity pruning to prune LLMs to at least 50% sparsity without the need of any retraining. It allocates sparsity adaptively based on sensitivity, allowing us to reduce pruning-induced error while maintaining the overall sparsity level. The advantages of the proposed method exhibit even more when the sparsity is extremely high. Furthermore, our method is compatible with quantization, enabling further compression of LLMs. We have released the available code.

Read more

4/24/2024

💬

Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models

Rocktim Jyoti Das, Mingjie Sun, Liqun Ma, Zhiqiang Shen

YC

0

Reddit

0

Large Language Models (LLMs) with billions of parameters are prime targets for network pruning, removing some model weights without hurting performance. Prior approaches such as magnitude pruning, SparseGPT, and Wanda, either concentrated solely on weights or integrated weights with activations for sparsity. However, they overlooked the informative gradients derived from pretrained LLMs. In this paper, we present a novel sparsity-centric pruning method for pretrained LLMs, termed Gradient-based Language Model Pruner (GBLM-Pruner). GBLM-Pruner leverages the first-order term of the Taylor expansion, operating in a training-free manner by harnessing properly normalized gradients from a few calibration samples to determine the pruning metric, and substantially outperforms competitive counterparts like SparseGPT and Wanda in multiple benchmarks. Intriguingly, by incorporating gradients, unstructured pruning with our method tends to reveal some structural patterns, which mirrors the geometric interdependence inherent in the LLMs' parameter structure. Additionally, GBLM-Pruner functions without any subsequent retraining or weight updates to maintain its simplicity as other counterparts. Extensive evaluations on LLaMA-1 and LLaMA-2 across various benchmarks show that GBLM-Pruner surpasses magnitude pruning, Wanda and SparseGPT by significant margins. We further extend our approach on Vision Transformer. Our code and models are available at https://github.com/VILA-Lab/GBLM-Pruner.

Read more

4/10/2024

Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment

Abhinav Agarwalla, Abhay Gupta, Alexandre Marques, Shubhra Pandit, Michael Goin, Eldar Kurtic, Kevin Leong, Tuan Nguyen, Mahmoud Salem, Dan Alistarh, Sean Lie, Mark Kurtz

YC

0

Reddit

0

Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks. We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs that achieve full accuracy recovery for fine-tuning tasks at up to 70% sparsity. We achieve this for the LLaMA-2 7B model by combining the SparseGPT one-shot pruning method and sparse pretraining of those models on a subset of the SlimPajama dataset mixed with a Python subset of The Stack dataset. We exhibit training acceleration due to sparsity on Cerebras CS-3 chips that closely matches theoretical scaling. In addition, we establish inference acceleration of up to 3x on CPUs by utilizing Neural Magic's DeepSparse engine and 1.7x on GPUs through Neural Magic's nm-vllm engine. The above gains are realized via sparsity alone, thus enabling further gains through additional use of quantization. Specifically, we show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x. We demonstrate these results across diverse, challenging tasks, including chat, instruction following, code generation, arithmetic reasoning, and summarization to prove their generality. This work paves the way for rapidly creating smaller and faster LLMs without sacrificing accuracy.

Read more

5/7/2024