One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models

2310.09499

YC

0

Reddit

0

Published 4/24/2024 by Hang Shao, Bei Liu, Bo Xiao, Ke Zeng, Guanglu Wan, Yanmin Qian

💬

Abstract

Various Large Language Models~(LLMs) from the Generative Pretrained Transformer(GPT) family have achieved outstanding performances in a wide range of text generation tasks. However, the enormous model sizes have hindered their practical use in real-world applications due to high inference latency. Therefore, improving the efficiencies of LLMs through quantization, pruning, and other means has been a key issue in LLM studies. In this work, we propose a method based on Hessian sensitivity-aware mixed sparsity pruning to prune LLMs to at least 50% sparsity without the need of any retraining. It allocates sparsity adaptively based on sensitivity, allowing us to reduce pruning-induced error while maintaining the overall sparsity level. The advantages of the proposed method exhibit even more when the sparsity is extremely high. Furthermore, our method is compatible with quantization, enabling further compression of LLMs. We have released the available code.

Create account to get full access

or

If you already have an account, we'll log you in

Overview

  • Large language models (LLMs) from the Generative Pretrained Transformer (GPT) family have achieved impressive performance on various text generation tasks.
  • However, their enormous model sizes have hindered practical use due to high inference latency.
  • Improving the efficiency of LLMs through techniques like quantization, pruning, and other methods has been a key focus.

Plain English Explanation

Large language models, such as those from the GPT family, have shown remarkable capabilities in generating human-like text. These models have become increasingly powerful, but their immense size has created challenges in real-world applications. The high amount of data and computations required to run these models can lead to slow response times, making them impractical for many practical uses.

To address this issue, researchers have been exploring ways to streamline and optimize these language models without significantly compromising their performance. One approach is pruning, which involves selectively removing certain parts of the model without retraining the entire system. This can help reduce the model's size and complexity, resulting in faster inference times.

In this research, the authors propose a new pruning method based on Hessian sensitivity-aware mixed sparsity. This approach adaptively allocates sparsity, or the degree of pruning, based on the sensitivity of different parts of the model. By doing so, they can achieve at least 50% sparsity without the need for retraining, while minimizing the loss in performance.

Furthermore, the proposed method is compatible with quantization, another technique for compressing models by reducing the precision of numeric representations. This allows for even greater compression of the language models, further improving their efficiency and practicality for real-world applications.

Technical Explanation

The researchers present a method for pruning large language models (LLMs) from the GPT family to achieve at least 50% sparsity without the need for retraining. Their approach is based on Hessian sensitivity-aware mixed sparsity pruning, which allocates sparsity adaptively based on the sensitivity of different parts of the model.

The Hessian sensitivity-aware approach allows the researchers to reduce pruning-induced error while maintaining the overall sparsity level. This is particularly beneficial when the sparsity is extremely high, as the proposed method can better preserve the model's performance.

Additionally, the researchers show that their pruning method is compatible with quantization, enabling further compression of the LLMs. This combination of pruning and quantization allows for significant improvements in the efficiency of these large language models, making them more practical for real-world applications.

Critical Analysis

The researchers have presented a promising approach for improving the efficiency of large language models without extensive retraining. The use of Hessian sensitivity-aware mixed sparsity pruning appears to be an effective strategy for achieving high sparsity levels while minimizing performance degradation.

One potential limitation of the research is the lack of comprehensive benchmarking across a wider range of language tasks and datasets. The authors have primarily focused on evaluating the pruned models on a few specific tasks, and it would be valuable to see how the approach performs on a more diverse set of applications.

Additionally, the researchers could explore the potential trade-offs between the level of sparsity achieved and the resulting inference latency improvements. This could help users better understand the practical implications of the proposed method and make informed decisions about the appropriate level of model compression for their specific use cases.

Further research could also investigate the interplay between pruning and quantization, examining how the two techniques can be combined and optimized to achieve the best balance of model efficiency and performance.

Conclusion

The proposed Hessian sensitivity-aware mixed sparsity pruning method represents a significant advancement in improving the efficiency of large language models. By adaptively allocating sparsity based on the sensitivity of different model parts, the researchers have demonstrated the ability to prune GPT-based LLMs to at least 50% sparsity without retraining, while maintaining overall performance.

The compatibility of this pruning approach with quantization further enhances the compression capabilities of these models, making them more practical for real-world applications that require low inference latency. As the field of large language models continues to evolve, this research highlights the importance of developing efficient techniques to address the computational challenges posed by the growing model sizes.

The availability of the code for this work is also a valuable contribution, as it allows other researchers and developers to build upon and further explore the potential of these efficiency-focused techniques for large language models.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment

Abhinav Agarwalla, Abhay Gupta, Alexandre Marques, Shubhra Pandit, Michael Goin, Eldar Kurtic, Kevin Leong, Tuan Nguyen, Mahmoud Salem, Dan Alistarh, Sean Lie, Mark Kurtz

YC

0

Reddit

0

Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks. We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs that achieve full accuracy recovery for fine-tuning tasks at up to 70% sparsity. We achieve this for the LLaMA-2 7B model by combining the SparseGPT one-shot pruning method and sparse pretraining of those models on a subset of the SlimPajama dataset mixed with a Python subset of The Stack dataset. We exhibit training acceleration due to sparsity on Cerebras CS-3 chips that closely matches theoretical scaling. In addition, we establish inference acceleration of up to 3x on CPUs by utilizing Neural Magic's DeepSparse engine and 1.7x on GPUs through Neural Magic's nm-vllm engine. The above gains are realized via sparsity alone, thus enabling further gains through additional use of quantization. Specifically, we show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x. We demonstrate these results across diverse, challenging tasks, including chat, instruction following, code generation, arithmetic reasoning, and summarization to prove their generality. This work paves the way for rapidly creating smaller and faster LLMs without sacrificing accuracy.

Read more

5/7/2024

SparseLLM: Towards Global Pruning for Pre-trained Language Models

SparseLLM: Towards Global Pruning for Pre-trained Language Models

Guangji Bai, Yijiang Li, Chen Ling, Kibaek Kim, Liang Zhao

YC

0

Reddit

0

The transformative impact of large language models (LLMs) like LLaMA and GPT on natural language processing is countered by their prohibitive computational demands. Pruning has emerged as a pivotal compression strategy, introducing sparsity to enhance both memory and computational efficiency. Yet, traditional global pruning is impractical for LLMs due to scalability issues, while local pruning, despite its efficiency, leads to suboptimal solutions. Addressing these challenges, we propose SparseLLM, a novel framework that redefines the global pruning process into manageable, coordinated subproblems, allowing for resource-efficient optimization with global optimality. SparseLLM's approach, which conceptualizes LLMs as a chain of modular functions and leverages auxiliary variables for problem decomposition, not only facilitates a pragmatic application on LLMs but also demonstrates significant performance improvements, particularly in high-sparsity regimes where it surpasses current state-of-the-art methods.

Read more

5/27/2024

SqueezeLLM: Dense-and-Sparse Quantization

SqueezeLLM: Dense-and-Sparse Quantization

Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, Kurt Keutzer

YC

0

Reddit

0

Generative Large Language Models (LLMs) have demonstrated remarkable results for a wide range of tasks. However, deploying these models for inference has been a significant challenge due to their unprecedented resource requirements. This has forced existing deployment frameworks to use multi-GPU inference pipelines, which are often complex and costly, or to use smaller and less performant models. In this work, we demonstrate that the main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, specifically for single batch inference. While quantization has emerged as a promising solution by representing weights with reduced precision, previous efforts have often resulted in notable performance degradation. To address this, we introduce SqueezeLLM, a post-training quantization framework that not only enables lossless compression to ultra-low precisions of up to 3-bit, but also achieves higher quantization performance under the same memory constraint. Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format. When applied to the LLaMA models, our 3-bit quantization significantly reduces the perplexity gap from the FP16 baseline by up to 2.1x as compared to the state-of-the-art methods with the same memory requirement. Furthermore, when deployed on an A6000 GPU, our quantized models achieve up to 2.3x speedup compared to the baseline. Our code is available at https://github.com/SqueezeAILab/SqueezeLLM.

Read more

6/6/2024

💬

Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models

Rocktim Jyoti Das, Mingjie Sun, Liqun Ma, Zhiqiang Shen

YC

0

Reddit

0

Large Language Models (LLMs) with billions of parameters are prime targets for network pruning, removing some model weights without hurting performance. Prior approaches such as magnitude pruning, SparseGPT, and Wanda, either concentrated solely on weights or integrated weights with activations for sparsity. However, they overlooked the informative gradients derived from pretrained LLMs. In this paper, we present a novel sparsity-centric pruning method for pretrained LLMs, termed Gradient-based Language Model Pruner (GBLM-Pruner). GBLM-Pruner leverages the first-order term of the Taylor expansion, operating in a training-free manner by harnessing properly normalized gradients from a few calibration samples to determine the pruning metric, and substantially outperforms competitive counterparts like SparseGPT and Wanda in multiple benchmarks. Intriguingly, by incorporating gradients, unstructured pruning with our method tends to reveal some structural patterns, which mirrors the geometric interdependence inherent in the LLMs' parameter structure. Additionally, GBLM-Pruner functions without any subsequent retraining or weight updates to maintain its simplicity as other counterparts. Extensive evaluations on LLaMA-1 and LLaMA-2 across various benchmarks show that GBLM-Pruner surpasses magnitude pruning, Wanda and SparseGPT by significant margins. We further extend our approach on Vision Transformer. Our code and models are available at https://github.com/VILA-Lab/GBLM-Pruner.

Read more

4/10/2024