Deep Neural Network (DNN) pruning has emerged as a key strategy to reduce model size, improve inference latency, and lower power consumption on DNN accelerators. Among various pruning techniques, block and output channel pruning have shown significant potential in accelerating hardware performance. However, their accuracy often requires further improvement. In response to this challenge, we introduce a separate, dynamic and differentiable (SMART) pruner. This pruner stands out by utilizing a separate, learnable probability mask for weight importance ranking, employing a differentiable Top k operator to achieve target sparsity, and leveraging a dynamic temperature parameter trick to escape from non-sparse local minima. In our experiments, the SMART pruner consistently demonstrated its superiority over existing pruning methods across a wide range of tasks and models on block and output channel pruning. Additionally, we extend our testing to Transformer-based models in N:M pruning scenarios, where SMART pruner also yields state-of-the-art results, demonstrating its adaptability and robustness across various neural network architectures, and pruning types.

## Introduction

This paper focuses on improving the deployment efficiency of deep neural networks (DNNs) on resource-constrained edge devices. One popular approach for this is neural network pruning, which removes less significant parameters from the network to conserve computational resources. The paper discusses various hardware-supported pruning techniques, including N:M pruning, block pruning, and output channel pruning.

The paper then reviews the limitations of existing pruning methods, such as the risk of pruning important weights in criteria-based approaches, the inability of weight-frozen learnable mask methods to adapt to weight changes, and the "STE learning delay problem" in straight-through estimator (STE)-based approaches.

To address these challenges, the authors propose a novel pruning method called SMART pruning. SMART uses a separate learnable soft probability mask to rank the importance of weights across different layers, which helps capture cross-layer dependencies more effectively. It also employs a differentiable Top-k operator and a dynamic temperature parameter trick to facilitate the mask's convergence to a binary mask without the risk of getting stuck in a non-sparse local minimum.

Compared to prior techniques, SMART offers several advantages: it integrates weight importance ranking into the training process, dynamically adapts the mask to weight changes, eliminates the need for complicated auxiliary model training, and provides convergence guarantees. The paper also includes theoretical analysis and experimental results demonstrating the state-of-the-art performance of SMART pruning across various models and tasks.

## Methodology

The text introduces the methodology of the SMART pruner. Section 2.1 presents the differentiable Top k operator, which is a key component of the SMART pruner. Section 2.2 explains the SMART pruning algorithm, including the problem formulation and training flow. Finally, Section 2.3 discusses the dynamic temperature parameter trick and its role in avoiding non-sparse local minima.

![Figure 1: Illustration of the SMART pruner. In the forward pass, the original weights (top left) are element-wise multiplied by a rescaled importance mask, fτ⁢(m)subscript𝑓𝜏𝑚f_{\tau}(m)italic_f start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( italic_m ), generated from a differentiable top-k𝑘kitalic_k operator, to obtain the masked weights matrix (top right). In the backward pass, the original weights parameter w𝑤witalic_w and mask parameter m𝑚mitalic_m are updated via backpropagation.](https://arxiv.org/html/2403.19969v1/extracted/5503058/Figure1.jpg)

*Figure 1: Illustration of the SMART pruner. In the forward pass, the original weights (top left) are element-wise multiplied by a rescaled importance mask, fτ⁢(m)subscript𝑓𝜏𝑚f_{\tau}(m)italic_f start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( italic_m ), generated from a differentiable top-k𝑘kitalic_k operator, to obtain the masked weights matrix (top right). In the backward pass, the original weights parameter w𝑤witalic_w and mask parameter m𝑚mitalic_m are updated via backpropagation.*

The paper discusses a differentiable Top k operator and a SMART pruning algorithm that leverages this operator. The key points are:

1. The standard Top k operator is non-differentiable, which hinders its use in gradient-based learning frameworks. To address this, the paper introduces a differentiable Top k operator based on the sigmoid function. This operator can approximate the standard Top k operator with arbitrary precision by adjusting a temperature parameter.

2. The paper formulates the pruning problem as an optimization task and proposes the SMART pruning algorithm to solve it. SMART uses the differentiable Top k operator to identify the optimal pruning structure. It consists of three stages: pretraining to develop an initial model, structural searching to determine the pruning structure, and fine-tuning to enhance performance.

3. To prevent the algorithm from converging to non-sparse local minima, the paper introduces a dynamic temperature parameter trick. This involves continuously reducing the temperature parameter during the structural searching phase, which induces fluctuations that help the algorithm escape non-sparse solutions.

4. Theorems and propositions are provided to analyze the properties of the differentiable Top k operator and the effectiveness of the dynamic temperature parameter trick in facilitating sparse solutions.

The SMART pruning algorithm aims to directly solve the fundamental pruning problem, mitigating the impact of regularization bias and outperforming existing pruning methods.

## Experimental Results

The paper compares a SMART pruning algorithm with state-of-the-art block, output channel, and N:M pruning schemes across various computer vision tasks and models. The authors made minor modifications to some models to optimize hardware compatibility on a top-tier Chinese autonomous driving chip. All models were trained from scratch, and detailed hyperparameters are available in the appendix.

The study benchmarked the SMART pruning algorithm against recent advancements, including PDP, ACDC, PaS, and AWG, on classification, object detection, and image segmentation tasks. The authors used Resnet50, Yolov5m, and a variant of BiSeNetv2 on ImageNet, COCO, and Cityscapes datasets, respectively. The block shape was defined as 16x8x1x1 to ensure compatibility with the target hardware.

The primary goal was to balance model accuracy and latency, using sparsity as a proxy for latency. The experimental results, summarized in Table 2, show that the SMART pruning algorithm outperforms the benchmark methods across all the evaluated models and sparsity levels (30%, 50%, and 70%). The distinct layer-wise sparsity patterns exhibited by the SMART pruner, as depicted in Figure 4, suggest that its training flow influences the cross-layer sparsity distribution, contributing to its superior performance.

![Figure 4: Cross-layer sparsity distribution in block pruning. The x-axis represents the layer, the y-axis indicates sparsity, and different colors correspond to different models. The method names are shown at the top.](https://arxiv.org/html/2403.19969v1/extracted/5503058/Figure4.jpg)

*Figure 4: Cross-layer sparsity distribution in block pruning. The x-axis represents the layer, the y-axis indicates sparsity, and different colors correspond to different models. The method names are shown at the top.*

The provided text discusses the benefits and drawbacks of output channel pruning, which involves eliminating entire output channels in neural networks. Output channel pruning is advantageous as it reduces data I/O and can provide hardware-agnostic acceleration, meaning it can improve performance on any hardware without requiring specialized designs. However, it typically results in worse accuracy compared to other structural pruning methods at equivalent sparsity levels due to its larger granularity.

The text then compares five different pruning methods - SMART, PDP, PaS, AWG, and ACDC - across three models: Yolov5m, Resnet50, and BiSeNetv2. This comparison was conducted at output channel sparsity levels of 20% and 50%. The results demonstrate that the SMART pruning algorithm is significantly more effective than the other methods in the context of output channel pruning. The superior performance of the SMART pruner is attributed to its ability to achieve better cross-layer sparsity, which becomes more critical with the larger granularity of output channel pruning.

![Figure 5: Cross-layer sparsity distribution in output channel pruning. The graph definition is the same as Figure 4.](https://arxiv.org/html/2403.19969v1/extracted/5503058/Figure5.jpg)

*Figure 5: Cross-layer sparsity distribution in output channel pruning. The graph definition is the same as Figure 4.*

The provided text summarizes the application of various pruning algorithms, including the SMART pruning method, to Transformer-based models such as the Swin Transformer and Vision Transformer (ViT). The experiments used a 2:4 pruning ratio, and the performance metric was Top 1 accuracy. The results show that the SMART pruning algorithm performs well on Transformer-based models in the context of N:M pruning, demonstrating the algorithm's versatility and robustness beyond convolutional neural network (CNN) architectures.

## Conclusion

The paper proposes a new SMART pruner that improves block and output channel pruning. The SMART pruner ranks the importance of weights based on a learnable probability mask to better capture cross-layer correlations. It also uses a dynamic temperature parameter trick to overcome non-sparse local minima, further enhancing the pruner's effectiveness. Empirical studies show the SMART pruner outperforms existing methods across computer vision tasks and models, particularly for block and output channel pruning. Additionally, the SMART pruner demonstrates superiority in N:M pruning for Transformer-based models, showcasing its adaptability and robustness.

## Appendix A Proof of Theorem 2.1

The provided text proves Theorem 2.1 by contradiction. It defines a sequence of values z_i and shows that in the limit as τ goes to 0, the sum of the sigmoid functions of the z_i values must be equal to k, a constant. However, this leads to a contradiction, so the theorem must be true.

The key steps are:
1) Show that the difference between adjacent z_i values goes to infinity as τ goes to 0.
2) Show that the sum of the sigmoid functions of the z_i values equals k.
3) Show that the sigmoid function of z_(N-k) goes to 0 and the sigmoid function of z_(N-k+1) goes to 1 as τ goes to 0.
4) This leads to a contradiction, since the sum of the sigmoid functions cannot equal k, so the theorem must be true.

There is no text provided for a summary after this section.

## Appendix B Proof of Proposition 2.2

The provided text outlines the calculation of the gradient of the function f_i(x). The key steps are:

1. The gradient of f_i(x) with respect to x_j is derived as:
d/dx_j f_i(x) = (1/τ)σ'((x_i/τ) + t(x_1, ..., x_N))*(I_[j=i] - σ'((x_j/τ) + t(x_1, ..., x_N))/Σ_i σ'((x_i/τ) + t(x_1, ..., x_N)))

2. The term d/dx_j t(x_1, ..., x_N) is obtained as:
d/dx_j t(x_1, ..., x_N) = -σ'((x_j/τ) + t(x_1, ..., x_N))/Σ_i σ'((x_i/τ) + t(x_1, ..., x_N))

3. By defining the vector v = [σ'((x_1/τ) + t(x_1, ..., x_N)), ..., σ'((x_N/τ) + t(x_1, ..., x_N))]^T, the Jacobian matrix J_Top_k(x) is given as:
J_Top_k(x) = (1/τ)*(diag(v) - v*v^T)

The provided equations and derivations allow for the calculation of the gradient of the function f_i(x) with respect to the input variables.

## Appendix C Proof of Theorem 2.3

The provided text presents a proof by contradiction. The main idea is to assume the existence of a vector w** that has a lower loss function value than the global optimum solution (w*, m*), which contradicts the premise that (w*, m*) is the global optimum. The proof involves defining w** and m** in terms of the assumed vector w^**, and then showing that L(w** ⊙ Top_k(m**)) < L(w* ⊙ Top_k(m*)), which is a contradiction.

## Appendix D Proof of Theorem 2.4

The proof shows that the differentiable Top-k function fᵧ(m) can approximate the standard Top-k function Top₆(m) with arbitrary precision. Specifically, there exists a τ such that for any τ in the range (0, τ̃], the difference between the two functions is less than Δ₁/Κ₁, where Δ₁ is the difference in the loss function L between the two Top-k approximations, and Κ₁ is the Lipschitz constant of L.

This allows the inequality L(w̃ ⊙ fᵧ(m̃)) - L(w* ⊙ fᵧ(m*)) ≥ Δ₁ - Δ₁ = 0 to hold. The proof concludes with this result.

## Appendix E Illustration of STE Learning Delay Issue

The provided text summarizes the loss function and gradients used in typical straight-through estimator (STE)-based methods. The loss function is mathematically formulated as the minimization of a primary loss term L(w⊙I(m)) plus a regularization term λR(m), where w and m represent the weight and mask parameters, respectively, I(m) is the rounding operator that rounds the mask values to 0 or 1, and λ is a tuning parameter.

The gradients of this loss function with respect to the weights (w) and masks (m) are also derived. The gradient with respect to the weights is the gradient of the loss with respect to the effective weights (w⊙I(m)), while the gradient with respect to the masks involves the gradient of the loss with respect to the effective weights, as well as the gradient of the regularization term.

The text also notes that when the mask value is 0, the weight gradient is consistently zero, while when the mask value is 1, the weight gradient becomes equivalent to the gradient of the loss with respect to the effective weights. This behavior can lead to accuracy drops and stability issues in certain applications.

![Figure 6: Illustration of STE learning delay issue.](https://arxiv.org/html/2403.19969v1/extracted/5503058/Figure6.jpg)

*Figure 6: Illustration of STE learning delay issue.*

## Appendix F Extension to N:M pruning

The paper discusses how the original SMART pruning algorithm, designed for block and output channel pruning, can be modified to work with N:M pruning. The key differences between N:M pruning and block/output channel pruning are:

1. N:M pruning does not involve issues related to cross-layer sparsity distribution.
2. Since N:M pruning is a fine-grained technique, using an importance mask would lead to a mask size comparable to the weight size. 

To adapt SMART pruning for N:M pruning, the original problem formulation in Equation (5) was revised, as shown in Equation (14). The main change is the removal of the importance mask parameters.

The modified algorithm, as illustrated in Algorithm 2, closely aligns with the PDP algorithm, differing in two aspects:
1. SMART pruning uses a soft top-k operator to generate the probability mask, while PDP uses SoftMax.
2. SMART pruning adopts a dynamic temperature parameter trick, while PDP uses fixed temperature parameters.

Experimental studies show that SMART pruning still outperforms PDP in N:M pruning tasks, although the margin is relatively small.

## Appendix G The Accumulated Weight and Gradient (AWG) Pruning Algorithm

The Accumulated Weight and Gradient pruner is an advanced variant of the magnitude-based pruning method. It uses a combination of pre-trained weights and accumulated gradients to determine the importance of each layer in the model. The pruning process consists of three stages:

1. The training data is fed to the pre-trained model, and the importance scores are calculated and tracked using an exponential moving average. The least important blocks of weights are then pruned.

2. The model is fine-tuned on the training set for a number of epochs, with the pruning masks kept unchanged.

3. Finally, the model is fine-tuned for a few more epochs.

The algorithm uses several parameters, such as the number of iterative pruning steps, the fine-tuning epochs, and the decay factor for the exponential moving average. The importance score for each layer is a product of the pre-trained weight, the accumulated gradient, and a scaling factor that favors layers with high sparsity.

## Appendix H Hyperparameter Settings

This section summarizes the hyperparameter settings used for different network architectures and pruning methods. The key details include:

Network architectures: ResNet50, ACDC, PAS, PDP, and Ours (SMART) for object detection, and BiSeNet v2, ViT, and Swin-Transformer for other tasks.

Batch size: Ranges from 16 to 256, depending on the network and pruning method.

Epochs: 100-200 epochs, except for BiSeNet v2 which uses 134.6 epochs.

Warm-up epochs: Varies from 0 to 20, depending on the network and pruning method.

Main optimizer and scheduler: AdamW or SGD with cosine or lambdaLR scheduling.

Mask optimizer and scheduler: Same as the main optimizer/scheduler in most cases, except for PAS which uses a separate SGD mask optimizer.

Pruning-specific parameters: Differ for each pruning method, such as maximum sparsity per layer, number of compression/decompression steps, number of search epochs, and temperature scheduling.

The provided table summarizes these hyperparameter settings in a concise manner.