Separate, Dynamic and Differentiable (SMART) Pruner for Block/Output Channel Pruning on Computer Vision Tasks
2403.19969
0
0
Abstract
Deep Neural Network (DNN) pruning has emerged as a key strategy to reduce model size, improve inference latency, and lower power consumption on DNN accelerators. Among various pruning techniques, block and output channel pruning have shown significant potential in accelerating hardware performance. However, their accuracy often requires further improvement. In response to this challenge, we introduce a separate, dynamic and differentiable (SMART) pruner. This pruner stands out by utilizing a separate, learnable probability mask for weight importance ranking, employing a differentiable Top k operator to achieve target sparsity, and leveraging a dynamic temperature parameter trick to escape from nonsparse local minima. In our experiments, the SMART pruner consistently demonstrated its superiority over existing pruning methods across a wide range of tasks and models on block and output channel pruning. Additionally, we extend our testing to Transformerbased models in N:M pruning scenarios, where SMART pruner also yields stateoftheart results, demonstrating its adaptability and robustness across various neural network architectures, and pruning types.
Get summaries of the top AI research delivered straight to your inbox:
Introduction
This paper focuses on improving the deployment efficiency of deep neural networks (DNNs) on resourceconstrained edge devices. One popular approach for this is neural network pruning, which removes less significant parameters from the network to conserve computational resources. The paper discusses various hardwaresupported pruning techniques, including N:M pruning, block pruning, and output channel pruning.
The paper then reviews the limitations of existing pruning methods, such as the risk of pruning important weights in criteriabased approaches, the inability of weightfrozen learnable mask methods to adapt to weight changes, and the "STE learning delay problem" in straightthrough estimator (STE)based approaches.
To address these challenges, the authors propose a novel pruning method called SMART pruning. SMART uses a separate learnable soft probability mask to rank the importance of weights across different layers, which helps capture crosslayer dependencies more effectively. It also employs a differentiable Topk operator and a dynamic temperature parameter trick to facilitate the mask's convergence to a binary mask without the risk of getting stuck in a nonsparse local minimum.
Compared to prior techniques, SMART offers several advantages: it integrates weight importance ranking into the training process, dynamically adapts the mask to weight changes, eliminates the need for complicated auxiliary model training, and provides convergence guarantees. The paper also includes theoretical analysis and experimental results demonstrating the stateoftheart performance of SMART pruning across various models and tasks.
Methodology
The text introduces the methodology of the SMART pruner. Section 2.1 presents the differentiable Top k operator, which is a key component of the SMART pruner. Section 2.2 explains the SMART pruning algorithm, including the problem formulation and training flow. Finally, Section 2.3 discusses the dynamic temperature parameter trick and its role in avoiding nonsparse local minima.
The paper discusses a differentiable Top k operator and a SMART pruning algorithm that leverages this operator. The key points are:

The standard Top k operator is nondifferentiable, which hinders its use in gradientbased learning frameworks. To address this, the paper introduces a differentiable Top k operator based on the sigmoid function. This operator can approximate the standard Top k operator with arbitrary precision by adjusting a temperature parameter.

The paper formulates the pruning problem as an optimization task and proposes the SMART pruning algorithm to solve it. SMART uses the differentiable Top k operator to identify the optimal pruning structure. It consists of three stages: pretraining to develop an initial model, structural searching to determine the pruning structure, and finetuning to enhance performance.

To prevent the algorithm from converging to nonsparse local minima, the paper introduces a dynamic temperature parameter trick. This involves continuously reducing the temperature parameter during the structural searching phase, which induces fluctuations that help the algorithm escape nonsparse solutions.

Theorems and propositions are provided to analyze the properties of the differentiable Top k operator and the effectiveness of the dynamic temperature parameter trick in facilitating sparse solutions.
The SMART pruning algorithm aims to directly solve the fundamental pruning problem, mitigating the impact of regularization bias and outperforming existing pruning methods.
Experimental Results
The paper compares a SMART pruning algorithm with stateoftheart block, output channel, and N:M pruning schemes across various computer vision tasks and models. The authors made minor modifications to some models to optimize hardware compatibility on a toptier Chinese autonomous driving chip. All models were trained from scratch, and detailed hyperparameters are available in the appendix.
The study benchmarked the SMART pruning algorithm against recent advancements, including PDP, ACDC, PaS, and AWG, on classification, object detection, and image segmentation tasks. The authors used Resnet50, Yolov5m, and a variant of BiSeNetv2 on ImageNet, COCO, and Cityscapes datasets, respectively. The block shape was defined as 16x8x1x1 to ensure compatibility with the target hardware.
The primary goal was to balance model accuracy and latency, using sparsity as a proxy for latency. The experimental results, summarized in Table 2, show that the SMART pruning algorithm outperforms the benchmark methods across all the evaluated models and sparsity levels (30%, 50%, and 70%). The distinct layerwise sparsity patterns exhibited by the SMART pruner, as depicted in Figure 4, suggest that its training flow influences the crosslayer sparsity distribution, contributing to its superior performance.
The provided text discusses the benefits and drawbacks of output channel pruning, which involves eliminating entire output channels in neural networks. Output channel pruning is advantageous as it reduces data I/O and can provide hardwareagnostic acceleration, meaning it can improve performance on any hardware without requiring specialized designs. However, it typically results in worse accuracy compared to other structural pruning methods at equivalent sparsity levels due to its larger granularity.
The text then compares five different pruning methods  SMART, PDP, PaS, AWG, and ACDC  across three models: Yolov5m, Resnet50, and BiSeNetv2. This comparison was conducted at output channel sparsity levels of 20% and 50%. The results demonstrate that the SMART pruning algorithm is significantly more effective than the other methods in the context of output channel pruning. The superior performance of the SMART pruner is attributed to its ability to achieve better crosslayer sparsity, which becomes more critical with the larger granularity of output channel pruning.
The provided text summarizes the application of various pruning algorithms, including the SMART pruning method, to Transformerbased models such as the Swin Transformer and Vision Transformer (ViT). The experiments used a 2:4 pruning ratio, and the performance metric was Top 1 accuracy. The results show that the SMART pruning algorithm performs well on Transformerbased models in the context of N:M pruning, demonstrating the algorithm's versatility and robustness beyond convolutional neural network (CNN) architectures.
Conclusion
The paper proposes a new SMART pruner that improves block and output channel pruning. The SMART pruner ranks the importance of weights based on a learnable probability mask to better capture crosslayer correlations. It also uses a dynamic temperature parameter trick to overcome nonsparse local minima, further enhancing the pruner's effectiveness. Empirical studies show the SMART pruner outperforms existing methods across computer vision tasks and models, particularly for block and output channel pruning. Additionally, the SMART pruner demonstrates superiority in N:M pruning for Transformerbased models, showcasing its adaptability and robustness.
Appendix A Proof of Theorem 2.1
The provided text proves Theorem 2.1 by contradiction. It defines a sequence of values z_i and shows that in the limit as τ goes to 0, the sum of the sigmoid functions of the z_i values must be equal to k, a constant. However, this leads to a contradiction, so the theorem must be true.
The key steps are:
 Show that the difference between adjacent z_i values goes to infinity as τ goes to 0.
 Show that the sum of the sigmoid functions of the z_i values equals k.
 Show that the sigmoid function of z_(Nk) goes to 0 and the sigmoid function of z_(Nk+1) goes to 1 as τ goes to 0.
 This leads to a contradiction, since the sum of the sigmoid functions cannot equal k, so the theorem must be true.
There is no text provided for a summary after this section.
Appendix B Proof of Proposition 2.2
The provided text outlines the calculation of the gradient of the function f_i(x). The key steps are:

The gradient of f_i(x) with respect to x_j is derived as: d/dx_j f_i(x) = (1/τ)σ'((x_i/τ) + t(x_1, ..., x_N))*(I_[j=i]  σ'((x_j/τ) + t(x_1, ..., x_N))/Σ_i σ'((x_i/τ) + t(x_1, ..., x_N)))

The term d/dx_j t(x_1, ..., x_N) is obtained as: d/dx_j t(x_1, ..., x_N) = σ'((x_j/τ) + t(x_1, ..., x_N))/Σ_i σ'((x_i/τ) + t(x_1, ..., x_N))

By defining the vector v = [σ'((x_1/τ) + t(x_1, ..., x_N)), ..., σ'((x_N/τ) + t(x_1, ..., x_N))]^T, the Jacobian matrix J_Top_k(x) is given as: J_Top_k(x) = (1/τ)
(diag(v)  v v^T)
The provided equations and derivations allow for the calculation of the gradient of the function f_i(x) with respect to the input variables.
Appendix C Proof of Theorem 2.3
The provided text presents a proof by contradiction. The main idea is to assume the existence of a vector w** that has a lower loss function value than the global optimum solution (w*, m*), which contradicts the premise that (w*, m*) is the global optimum. The proof involves defining w** and m** in terms of the assumed vector w^, and then showing that L(w ⊙ Top_k(m**)) < L(w* ⊙ Top_k(m*)), which is a contradiction.
Appendix D Proof of Theorem 2.4
The proof shows that the differentiable Topk function fᵧ(m) can approximate the standard Topk function Top₆(m) with arbitrary precision. Specifically, there exists a τ such that for any τ in the range (0, τ̃], the difference between the two functions is less than Δ₁/Κ₁, where Δ₁ is the difference in the loss function L between the two Topk approximations, and Κ₁ is the Lipschitz constant of L.
This allows the inequality L(w̃ ⊙ fᵧ(m̃))  L(w* ⊙ fᵧ(m*)) ≥ Δ₁  Δ₁ = 0 to hold. The proof concludes with this result.
Appendix E Illustration of STE Learning Delay Issue
The provided text summarizes the loss function and gradients used in typical straightthrough estimator (STE)based methods. The loss function is mathematically formulated as the minimization of a primary loss term L(w⊙I(m)) plus a regularization term λR(m), where w and m represent the weight and mask parameters, respectively, I(m) is the rounding operator that rounds the mask values to 0 or 1, and λ is a tuning parameter.
The gradients of this loss function with respect to the weights (w) and masks (m) are also derived. The gradient with respect to the weights is the gradient of the loss with respect to the effective weights (w⊙I(m)), while the gradient with respect to the masks involves the gradient of the loss with respect to the effective weights, as well as the gradient of the regularization term.
The text also notes that when the mask value is 0, the weight gradient is consistently zero, while when the mask value is 1, the weight gradient becomes equivalent to the gradient of the loss with respect to the effective weights. This behavior can lead to accuracy drops and stability issues in certain applications.
Appendix F Extension to N:M pruning
The paper discusses how the original SMART pruning algorithm, designed for block and output channel pruning, can be modified to work with N:M pruning. The key differences between N:M pruning and block/output channel pruning are:
 N:M pruning does not involve issues related to crosslayer sparsity distribution.
 Since N:M pruning is a finegrained technique, using an importance mask would lead to a mask size comparable to the weight size.
To adapt SMART pruning for N:M pruning, the original problem formulation in Equation (5) was revised, as shown in Equation (14). The main change is the removal of the importance mask parameters.
The modified algorithm, as illustrated in Algorithm 2, closely aligns with the PDP algorithm, differing in two aspects:
 SMART pruning uses a soft topk operator to generate the probability mask, while PDP uses SoftMax.
 SMART pruning adopts a dynamic temperature parameter trick, while PDP uses fixed temperature parameters.
Experimental studies show that SMART pruning still outperforms PDP in N:M pruning tasks, although the margin is relatively small.
Appendix G The Accumulated Weight and Gradient (AWG) Pruning Algorithm
The Accumulated Weight and Gradient pruner is an advanced variant of the magnitudebased pruning method. It uses a combination of pretrained weights and accumulated gradients to determine the importance of each layer in the model. The pruning process consists of three stages:

The training data is fed to the pretrained model, and the importance scores are calculated and tracked using an exponential moving average. The least important blocks of weights are then pruned.

The model is finetuned on the training set for a number of epochs, with the pruning masks kept unchanged.

Finally, the model is finetuned for a few more epochs.
The algorithm uses several parameters, such as the number of iterative pruning steps, the finetuning epochs, and the decay factor for the exponential moving average. The importance score for each layer is a product of the pretrained weight, the accumulated gradient, and a scaling factor that favors layers with high sparsity.
Appendix H Hyperparameter Settings
This section summarizes the hyperparameter settings used for different network architectures and pruning methods. The key details include:
Network architectures: ResNet50, ACDC, PAS, PDP, and Ours (SMART) for object detection, and BiSeNet v2, ViT, and SwinTransformer for other tasks.
Batch size: Ranges from 16 to 256, depending on the network and pruning method.
Epochs: 100200 epochs, except for BiSeNet v2 which uses 134.6 epochs.
Warmup epochs: Varies from 0 to 20, depending on the network and pruning method.
Main optimizer and scheduler: AdamW or SGD with cosine or lambdaLR scheduling.
Mask optimizer and scheduler: Same as the main optimizer/scheduler in most cases, except for PAS which uses a separate SGD mask optimizer.
Pruningspecific parameters: Differ for each pruning method, such as maximum sparsity per layer, number of compression/decompression steps, number of search epochs, and temperature scheduling.
The provided table summarizes these hyperparameter settings in a concise manner.
This summary was produced with help from an AI and may contain inaccuracies  check out the links to read the original source documents!
Related Papers
Automatic Channel Pruning for MultiHead Attention
Eunho Lee, Youngbae Hwang
0
0
Despite the strong performance of Transformers, their quadratic computation complexity presents challenges in applying them to vision tasks. Automatic pruning is one of effective methods for reducing computation complexity without heuristic approaches. However, directly applying it to multihead attention is not straightforward due to channel misalignment. In this paper, we propose an automatic channel pruning method to take into account the multihead attention mechanism. First, we incorporate channel similaritybased weights into the pruning indicator to preserve more informative channels in each head. Then, we adjust pruning indicator to enforce removal of channels in equal proportions across all heads, preventing the channel misalignment. We also add a reweight module to compensate for information loss resulting from channel removal, and an effective initialization step for pruning indicator based on difference of attention between original structure and each channel. Our proposed method can be used to not only original attention, but also linear attention, which is more efficient as linear complexity with respect to the number of tokens. On ImageNet1K, applying our pruning method to the FLattenTransformer, which includes both attention mechanisms, shows outperformed accuracy for several MACs compared with previous stateoftheart efficient models and pruned methods. Code will be available soon.
6/3/2024
Dataindependent Moduleaware Pruning for Hierarchical Vision Transformers
Yang He, Joey Tianyi Zhou
0
0
Hierarchical vision transformers (ViTs) have two advantages over conventional ViTs. First, hierarchical ViTs achieve linear computational complexity with respect to image size by local selfattention. Second, hierarchical ViTs create hierarchical feature maps by merging image patches in deeper layers for dense prediction. However, existing pruning methods ignore the unique properties of hierarchical ViTs and use the magnitude value as the weight importance. This approach leads to two main drawbacks. First, the local attention weights are compared at a global level, which may cause some locally important weights to be pruned due to their relatively small magnitude globally. The second issue with magnitude pruning is that it fails to consider the distinct weight distributions of the network, which are essential for extracting coarse to finegrained features at various hierarchical levels. To solve the aforementioned issues, we have developed a Dataindependent ModuleAware Pruning method (DIMAP) to compress hierarchical ViTs. To ensure that local attention weights at different hierarchical levels are compared fairly in terms of their contribution, we treat them as a module and examine their contribution by analyzing their information distortion. Furthermore, we introduce a novel weight metric that is solely based on weights and does not require input images, thereby eliminating the dependence on the patch merging process. Our method validates its usefulness and strengths on Swin Transformers of different sizes on ImageNet1k classification. Notably, the top5 accuracy drop is only 0.07% when we remove 52.5% FLOPs and 52.7% parameters of SwinB. When we reduce 33.2% FLOPs and 33.2% parameters of SwinS, we can even achieve a 0.8% higher relative top5 accuracy than the original model. Code is available at: https://github.com/hey/DataindependentModuleAwarePruning
4/23/2024
Decay Pruning Method: Smooth Pruning With a SelfRectifying Procedure
Minghao Yang, Linlin Gao, Pengyuan Li, Wenbo Li, Yihong Dong, Zhiying Cui
0
0
Current structured pruning methods often result in considerable accuracy drops due to abrupt network changes and loss of information from pruned structures. To address these issues, we introduce the Decay Pruning Method (DPM), a novel smooth pruning approach with a selfrectifying mechanism. DPM consists of two key components: (i) Smooth Pruning: It converts conventional singlestep pruning into multistep smooth pruning, gradually reducing redundant structures to zero over N steps with ongoing optimization. (ii) SelfRectifying: This procedure further enhances the aforementioned process by rectifying suboptimal pruning based on gradient information. Our approach demonstrates strong generalizability and can be easily integrated with various existing pruning methods. We validate the effectiveness of DPM by integrating it with three popular pruning methods: OTOv2, Depgraph, and Gate Decorator. Experimental results show consistent improvements in performance compared to the original pruning methods, along with further reductions of FLOPs in most scenarios.
6/7/2024
🏷️
ZeroTPrune: ZeroShot Token Pruning through Leveraging of the Attention Graph in PreTrained Transformers
Hongjie Wang, Bhishma Dedhia, Niraj K. Jha
0
0
Deployment of Transformer models on edge devices is becoming increasingly challenging due to the exponentially growing inference cost that scales quadratically with the number of tokens in the input sequence. Token pruning is an emerging solution to address this challenge due to its ease of deployment on various Transformer backbones. However, most token pruning methods require computationally expensive finetuning, which is undesirable in many edge deployment cases. In this work, we propose ZeroTPrune, the first zeroshot method that considers both the importance and similarity of tokens in performing token pruning. It leverages the attention graph of pretrained Transformer models to produce an importance distribution for tokens via our proposed Weighted Page Rank (WPR) algorithm. This distribution further guides token partitioning for efficient similaritybased pruning. Due to the elimination of the finetuning overhead, ZeroTPrune can prune large models at negligible computational cost, switch between different pruning configurations at no computational cost, and perform hyperparameter tuning efficiently. We evaluate the performance of ZeroTPrune on vision tasks by applying it to various vision Transformer backbones and testing them on ImageNet. Without any finetuning, ZeroTPrune reduces the FLOPs cost of DeiTS by 34.7% and improves its throughput by 45.3% with only 0.4% accuracy loss. Compared with stateoftheart pruning methods that require finetuning, ZeroTPrune not only eliminates the need for finetuning after pruning but also does so with only 0.1% accuracy loss. Compared with stateoftheart finetuningfree pruning methods, ZeroTPrune reduces accuracy loss by up to 49% with similar FLOPs budgets. Project webpage: https://jhalab.github.io/zerotprune.
4/9/2024