0

0

Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

    Published 4/12/2024 by Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, Danqi Chen

    Overview

    • The popularity of LLaMA and other moderate-sized language models highlights the potential of building smaller yet powerful models.
    • Training such models from scratch on trillions of tokens remains costly.
    • This work explores structured pruning as a way to develop smaller language models from pre-trained, larger models.
    • The approach uses two key techniques: targeted structured pruning and dynamic batch loading.
    • The result is the Sheared-LLaMA series, which outperforms state-of-the-art open-source models of equivalent size while requiring only 3% of the compute.

    Plain English Explanation

    The paper focuses on developing smaller, yet powerful, language models. Large language models (LLMs) like LLaMA have shown great potential, but training them from scratch requires enormous computational resources.

    The researchers explore a more efficient approach - taking a larger, pre-trained model and

    pruning
    it down to a smaller size. Pruning means selectively removing parts of the model, like layers, attention heads, and hidden dimensions, to reduce its overall size and complexity.

    The key innovations are:

    1. Targeted Structured Pruning: This prunes the model in an end-to-end manner, removing specific components to reach a target model size.
    2. Dynamic Batch Loading: This dynamically updates the training data in each batch, focusing on areas where the model is struggling the most.

    By applying these techniques, the researchers created the "Sheared-LLaMA" series - smaller versions of the LLaMA model that outperform other compact, open-source models on a range of tasks. Crucially, these Sheared-LLaMA models only require 3% of the compute needed to train a similar-sized model from scratch.

    Technical Explanation

    The paper presents a structured pruning approach to develop smaller yet powerful language models from larger, pre-trained models.

    The key techniques are:

    1. Targeted Structured Pruning: This method prunes a larger model to a specified target shape by removing layers, attention heads, and intermediate/hidden dimensions in an end-to-end manner. This allows for more fine-grained control over the model's architecture compared to unstructured pruning.

    2. Dynamic Batch Loading: This dynamically updates the composition of training data in each batch based on the varying losses across different domains. This helps the model focus on areas where it is struggling the most during fine-tuning.

    The researchers apply these techniques to prune the LLaMA2-7B model down to 1.3B and 2.7B parameter versions, creating the "Sheared-LLaMA" series. These pruned models outperform state-of-the-art open-source models of equivalent size, such as Pythia, INCITE, OpenLLaMA, and the concurrent TinyLlama models, on a wide range of downstream and instruction-tuning evaluations.

    Crucially, the Sheared-LLaMA models only require 3% of the compute needed to train such models from scratch, making this a far more cost-effective approach for building competitive small-scale language models.

    Critical Analysis

    The paper presents a compelling approach to developing smaller yet powerful language models by leveraging existing large models through structured pruning. The use of targeted pruning and dynamic batch loading are innovative techniques that allow for fine-grained control over the model architecture and training process.

    One potential limitation is that the paper only evaluates the Sheared-LLaMA models on a relatively narrow set of downstream tasks. It would be valuable to see how these models perform on a wider range of real-world applications, especially those that require more nuanced language understanding.

    Additionally, the paper does not delve into the interpretability or explainability of the pruned models. As these smaller models are intended for deployment in real-world scenarios, understanding their inner workings and decision-making processes could be an important area for future research.

    Finally, while the compute savings are substantial, the paper does not provide a detailed cost analysis or discussion of the environmental impact of this approach compared to training smaller models from scratch. As the field of AI continues to grapple with issues of sustainability, these considerations will become increasingly important.

    Overall, this work represents a promising step towards more efficient and accessible language models, but further research is needed to fully understand the broader implications and potential limitations of this approach.

    Conclusion

    This paper presents a structured pruning approach to developing smaller yet powerful language models from larger, pre-trained models. By employing targeted pruning and dynamic batch loading techniques, the researchers were able to create the Sheared-LLaMA series, which outperforms state-of-the-art open-source models of equivalent size while requiring only a fraction of the compute needed to train such models from scratch.

    This work highlights the potential of leveraging existing large language models to build more cost-effective and accessible smaller-scale models, which could have significant implications for the democratization of AI technology and its broader societal impact. As the field continues to grapple with issues of model size, efficiency, and sustainability, this research provides a compelling example of how innovation in model architecture and training can lead to more practical and impactful language models.

    Full paper

    Loading...

    Loading PDF viewer...

    Read original: arXiv:2310.06694



    This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

    Total Score

    0

    Follow @aimodelsfyi on 𝕏 →

    Related Papers

    Optimization-based Structural Pruning for Large Language Models without Back-Propagation
    Total Score

    0

    Optimization-based Structural Pruning for Large Language Models without Back-Propagation

    Yuan Gao, Zujing Liu, Weizhong Zhang, Bo Du, Gui-Song Xia

    In contrast to moderate-size neural network pruning, structural weight pruning on the Large-Language Models (LLMs) imposes a novel challenge on the efficiency of the pruning algorithms, due to the heavy computation/memory demands of the LLMs. Recent efficient LLM pruning methods typically operate at the post-training phase without the expensive weight finetuning, however, their pruning criteria often rely on heuristically hand-crafted metrics, potentially leading to suboptimal performance. We instead propose a novel optimization-based structural pruning that learns the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. To preserve the efficiency, our method eliminates the back-propagation through the LLM per se during the optimization, requiring only the forward pass of the LLM. We achieve this by learning an underlying Bernoulli distribution to sample binary pruning masks, where we decouple the Bernoulli parameters from the LLM loss, thus facilitating an efficient optimization via a policy gradient estimator without back-propagation. As a result, our method is able to 1) operate at structural granularities of channels, heads, and layers, 2) support global and heterogeneous pruning (i.e., our method automatically determines different redundancy for different layers), and 3) optionally initialize with a metric-based method (for our Bernoulli distributions). Extensive experiments on LLaMA, LLaMA-2, LLaMA-3, Vicuna, and Mistral using the C4 and WikiText2 datasets demonstrate that our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU, and our pruned models outperform the state-of-the-arts w.r.t. both perplexity and the majority of various zero-shot tasks. Codes will be released.

    Read more

    10/22/2024

    Efficient Pruning of Large Language Model with Adaptive Estimation Fusion
    Total Score

    0

    Efficient Pruning of Large Language Model with Adaptive Estimation Fusion

    Jun Liu, Chao Wu, Changdi Yang, Hao Tang, Zhenglun Kong, Geng Yuan, Wei Niu, Dong Huang, Yanzhi Wang

    Large language models (LLMs) have become crucial for many generative downstream tasks, leading to an inevitable trend and significant challenge to deploy them efficiently on resource-constrained devices. Structured pruning is a widely used method to address this challenge. However, when dealing with the complex structure of the multiple decoder layers, general methods often employ common estimation approaches for pruning. These approaches lead to a decline in accuracy for specific downstream tasks. In this paper, we introduce a simple yet efficient method that adaptively models the importance of each substructure. Meanwhile, it can adaptively fuse coarse-grained and finegrained estimations based on the results from complex and multilayer structures. All aspects of our design seamlessly integrate into the endto-end pruning framework. Our experimental results, compared with state-of-the-art methods on mainstream datasets, demonstrate average accuracy improvements of 1.1%, 1.02%, 2.0%, and 1.2% for LLaMa-7B,Vicuna-7B, Baichuan-7B, and Bloom-7b1, respectively.

    Read more

    5/16/2024

    SparseLLM: Towards Global Pruning for Pre-trained Language Models
    Total Score

    0

    SparseLLM: Towards Global Pruning for Pre-trained Language Models

    Guangji Bai, Yijiang Li, Chen Ling, Kibaek Kim, Liang Zhao

    The transformative impact of large language models (LLMs) like LLaMA and GPT on natural language processing is countered by their prohibitive computational demands. Pruning has emerged as a pivotal compression strategy, introducing sparsity to enhance both memory and computational efficiency. Yet, traditional global pruning is impractical for LLMs due to scalability issues, while local pruning, despite its efficiency, leads to suboptimal solutions. Addressing these challenges, we propose SparseLLM, a novel framework that redefines the global pruning process into manageable, coordinated subproblems, allowing for resource-efficient optimization with global optimality. SparseLLM's approach, which conceptualizes LLMs as a chain of modular functions and leverages auxiliary variables for problem decomposition, not only facilitates a pragmatic application on LLMs but also demonstrates significant performance improvements, particularly in high-sparsity regimes where it surpasses current state-of-the-art methods.

    Read more

    11/4/2024

    DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models
    Total Score

    0

    DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models

    Shangqian Gao, Chi-Heng Lin, Ting Hua, Tang Zheng, Yilin Shen, Hongxia Jin, Yen-Chang Hsu

    Large Language Models (LLMs) have achieved remarkable success in various natural language processing tasks, including language modeling, understanding, and generation. However, the increased memory and computational costs associated with these models pose significant challenges for deployment on resource-limited devices. Structural pruning has emerged as a promising solution to reduce the costs of LLMs without requiring post-processing steps. Prior structural pruning methods either follow the dependence of structures at the cost of limiting flexibility, or introduce non-trivial additional parameters by incorporating different projection matrices. In this work, we propose a novel approach that relaxes the constraint imposed by regular structural pruning methods and eliminates the structural dependence along the embedding dimension. Our dimension-independent structural pruning method offers several benefits. Firstly, our method enables different blocks to utilize different subsets of the feature maps. Secondly, by removing structural dependence, we facilitate each block to possess varying widths along its input and output dimensions, thereby significantly enhancing the flexibility of structural pruning. We evaluate our method on various LLMs, including OPT, LLaMA, LLaMA-2, Phi-1.5, and Phi-2. Experimental results demonstrate that our approach outperforms other state-of-the-art methods, showing for the first time that structural pruning can achieve an accuracy similar to semi-structural pruning.

    Read more

    11/5/2024