Paper Reading Help

ResRep: Lossless CNN Pruning via Decoupling Remembering and Forgetting

ResRep: Novel Method for Lossless Channel Pruning
Inspiration from Neurobiology
Re-parameterization into Remembering and Forgetting Parts
Independence of Remembering and Forgetting
Remembering Parts Maintain Performance
Forgetting Parts Learn to Prune
Training with Regular SGD
Novel Update Rule with Penalty Gradients
Realize Structured Sparsity
Merge Remembering and Forgetting Parts
Original Architecture with Narrower Layers
Structural Re-parameterization Application
Distinction from Traditional Learning-Based Pruning
Penalty on Parameters May Suppress Essential Remembering
Results: Slimmed Down ResNet-50
76.15% Accuracy on ImageNet
45% FLOPs Reduction
No Accuracy Drop
First to Achieve Lossless Pruning with High Compression Ratio

Abstract

  • We propose to re-parameterize CNN into remembering parts and forgetting parts,

    • where the former learn to maintain the performance

    • and the latter learns to prune

  • Via training with regular SGD on the former but a novel update rule with penalty gradients on the latter, we realize structured sparsity.

  • Then we equivalently merge the remembering and forgetting parts into the original architecture with narrower layers.

  • In this sense, ResRep can be viewed as a successful application of Structural Re-parameterization

Introduction

sparsity

Global sparse momentum SGD for pruning very deep neural networks

https://arxiv.org/abs/1909.12778

image_20240125_104700.png

To overcome the drawbacks of the two paradigms discussed above

  • we intend to explicitly control the eventual compression ratio via end-to-end training

  • by directly altering the gradient flow of momentum SGD to deviate the training direction in order to achieve a high-compression ratio as well as maintain the accuracy

Approximation metrics
image_20240125_105600.png
Rewritten SGD Rule
image_20240125_110100.png

Channel Pruning

https://arxiv.org/abs/2002.10179

  • Our HRank is inspired by the discovery that the average rank of multiple feature maps generated by a single filter is always the same, regardless of the number of image batches CNNs receive.

  • Based on HRank, we develop a method that is mathematically formulated to prune filters with low-rank feature maps.

  • The principle behind our pruning is that low-rank feature maps contain less information, and thus pruned results can be easily reproduced.

Quantization

Post training 4-bit quantization of convolutional networks for rapid-deployment

https://arxiv.org/abs/1810.05723

We use ACIQ for activation quantization and bias-correction for quantizing weights.

Analytical Clipping for Integer Quantization (ACIQ)

Assuming bit-width M, we would like to quantize the values in the tensor uniformly to 2^M discrete values.

image_20240125_165800.png
image_20240125_171000.png
image_20240125_171200.png
image_20240125_171400.png
Per-channel bit allocation
image_20240125_174500.png
Bias-correction
image_20240125_181000.png
image_20240125_181100.png

Bi-real net

https://arxiv.org/abs/1808.00278

Bi-real net: Enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm.: Combining 1-bit with original feature map shortcut to keep information

FOCUSED QUANTIZATION FOR SPARSE CNN

https://arxiv.org/abs/1903.03046

  • Shift quantization of weights, which quantizes weight values in a model to powers-of-two or zero, i.e. {0, ±1, ±2, ±4, . . .}, is of particular of interest, as multiplications in convolutions become much-simpler bit-shift operations.

  • Fine-grained pruning, however, is often in conflict with quantization, as pruning introduces various degrees of sparsity to different layers.

  • Linear quantization methods (integers) have uniform quantization levels and non-linear quantization (logarithmic, floating-point and shift) have fine levels around zero

  • but levels grow further apart as values get larger in magnitude.

  • Both linear and nonlinear quantization thus provide precision where it is not actually required in the case of a pruned CNN

  • We address both issues by proposing a new approach to quantizing parameters in CNNs which we call focused quantization (FQ) that mixes shift and re-centralized quantization methods.

image_20240125_222800.png
image_20240126_10520.png
image_20240126_110500.png
image_20240126_111900.png
image_20240126_113300.png

is used to select which quantized component could be better to fit the original distribution.

Knowledge Distillation

Geometry-Aware Distillation for Indoor Semantic Segmentation

https://ieeexplore.ieee.org/document/8954087

image_20240126_181200.png
image_20240126_181300.png
image_20240126_181400.png

Structured Knowledge Distillation for Semantic Segmentation

https://ieeexplore.ieee.org/document/8954081

  • Teacher and Student

  • pair-wise distillation

    • learn association between each pair of pixels

  • pixel-wise distillation

    • simply learn classification results of each pixel

  • holistic distillation

    • Like GAN, with a discriminator network

Unifying Heterogeneous Classifiers with Distillation

https://arxiv.org/abs/1904.06062

  • Merge Heterogeneous Classifiers into a single one.

Typical Pruning Paradigm

image_20240127_151400.png

Auto-balanced filter pruning for efficient convolutional neural networks

https://ojs.aaai.org/index.php/AAAI/article/view/12262

  • The word auto-balanced includes two meanings.

  • On the one hand, according to Eq. 13, the intensity of stimulation on strong filters varies with the weak ones. When the weak filters are zeroed out, the stimulation automatically stops and the training converges.

  • On the other hand, as the weak filters in a certain layer are weakened and the strong ones are stimulated, the representational capacity of the weak part is gradually transferred to the strong part, keeping the whole layer’s representational capacity unharmed.

Resistance & Prunability

We evaluate a training-based pruning method from two aspects.

  • Resistance

    • We say a model has high resistance if the performance maintains high during training.

  • Prunability

    • If the model endures a high pruning ratio with low performance drop, we say it has high prunability.

We desire both high resistance and prunability, but the traditional penalty-based paradigm naturally suffers from a resistance-prunability trade-off.

Two Key Components

ResRep comprises two key components: Convolutional Re-parameterization (Rep, the methodology of decoupling and the corresponding equivalent conversion) and Gradient Resetting (Res, the update rule for “forgetting”).

image_20240127_185100.png

The lottery ticket hypothesis: Finding sparse, trainable neural networks

this hypothesis suggests that when training a large neural network, the successful training of the model is equivalent to finding a small, "winning" subnetwork (termed as the 'lottery ticket') within the large network. These subnetworks are sparse, i.e., only a certain small fraction of the parameters is non-zero.

Pruning-then-Fine-tuning

Some methods repeat pruning-finetuning iterations to measure the importance and prune progressively.

A major drawback is that the pruned models can be easily trapped into bad local minima, and sometimes cannot even reach a similar level of accuracy with a counterpart of the same structure trained from scratch.

This discovery highlights the significance of perfect pruning, which eliminates the need for fine-tuning.

ResRep for Lossless Channel Pruning

Convolutional Re-parameterization

For every conv layer together with the following BN (if any) we desire to prune, which are referred to as the target layers, we append a compactor (1 × 1 conv) with kernel

Given a well-trained model W, we construct a re-parameterized model Wˆ by initializing the conv-BN as the original weights of W and Q as an identity matrix, so that the re-parameterized model produces the identical outputs as the original.

image_20240128_124300.png

Details about how to cal them could be found in the paper.

Gradient Resetting

We describe how to produce structured sparsity in compactors while maintaining the accuracy, beginning by discussing the traditional usage of penalty on a specific kernel K to make the magnitude of some channels smaller, i.e.

we denote a specific channel in K by

competence-based importance evaluation

image_20240128_174200.png

Dilemma Inside

  • Problem A: The penalty deviates the parameters of every channel from the optima of the objective function.

    • Notably, a mild deviation may not bring negative effects; e.g., L2 regularization can also be viewed as a mild deviation.

    • However, with a strong penalty, though some channels are zeroed out for pruning, the remaining channels are also made too small to maintain the representational capacity, which is an undesired side effect.

  • Problem B: With mild penalty for the high resistance, we cannot achieve high prunability, because most of the channels merely become closer to 0 than they used to be, but not close enough for perfect pruning.

Solution

We propose to achieve high prunability with a mild penalty by resetting the gradients derived from the objective function.

image_20240128_175600.png

The Remembering Parts Remember Always, the Forgetting Parts Forget Progressively

To combine Res with Rep, we need to decide which channels of Q to be zeroed out.

image_20240128_183300.png
image_20240128_183900.png
Last modified: 10 March 2024