ResRep: Lossless CNN Pruning via Decoupling Remembering and Forgetting

ResRep: Novel Method for Lossless Channel Pruning
Inspiration from Neurobiology
Re-parameterization into Remembering and Forgetting Parts
Independence of Remembering and Forgetting
Remembering Parts Maintain Performance
Forgetting Parts Learn to Prune
Training with Regular SGD
Novel Update Rule with Penalty Gradients
Realize Structured Sparsity
Merge Remembering and Forgetting Parts
Original Architecture with Narrower Layers
Structural Re-parameterization Application
Distinction from Traditional Learning-Based Pruning
Penalty on Parameters May Suppress Essential Remembering
Results: Slimmed Down ResNet-50
76.15% Accuracy on ImageNet
45% FLOPs Reduction
No Accuracy Drop
First to Achieve Lossless Pruning with High Compression Ratio


  • We propose to re-parameterize CNN into remembering parts and forgetting parts,

    • where the former learn to maintain the performance

    • and the latter learns to prune

  • Via training with regular SGD on the former but a novel update rule with penalty gradients on the latter, we realize structured sparsity.

  • Then we equivalently merge the remembering and forgetting parts into the original architecture with narrower layers.

  • In this sense, ResRep can be viewed as a successful application of Structural Re-parameterization



Global sparse momentum SGD for pruning very deep neural networks


To overcome the drawbacks of the two paradigms discussed above

  • we intend to explicitly control the eventual compression ratio via end-to-end training

  • by directly altering the gradient flow of momentum SGD to deviate the training direction in order to achieve a high-compression ratio as well as maintain the accuracy

Approximation metrics
Rewritten SGD Rule

Channel Pruning

  • Our HRank is inspired by the discovery that the average rank of multiple feature maps generated by a single filter is always the same, regardless of the number of image batches CNNs receive.

  • Based on HRank, we develop a method that is mathematically formulated to prune filters with low-rank feature maps.

  • The principle behind our pruning is that low-rank feature maps contain less information, and thus pruned results can be easily reproduced.


Post training 4-bit quantization of convolutional networks for rapid-deployment

We use ACIQ for activation quantization and bias-correction for quantizing weights.

Analytical Clipping for Integer Quantization (ACIQ)

Assuming bit-width M, we would like to quantize the values in the tensor uniformly to 2^M discrete values.

Per-channel bit allocation

Bi-real net

Bi-real net: Enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm.: Combining 1-bit with original feature map shortcut to keep information


  • Shift quantization of weights, which quantizes weight values in a model to powers-of-two or zero, i.e. {0, ±1, ±2, ±4, . . .}, is of particular of interest, as multiplications in convolutions become much-simpler bit-shift operations.

  • Fine-grained pruning, however, is often in conflict with quantization, as pruning introduces various degrees of sparsity to different layers.

  • Linear quantization methods (integers) have uniform quantization levels and non-linear quantization (logarithmic, floating-point and shift) have fine levels around zero

  • but levels grow further apart as values get larger in magnitude.

  • Both linear and nonlinear quantization thus provide precision where it is not actually required in the case of a pruned CNN

  • We address both issues by proposing a new approach to quantizing parameters in CNNs which we call focused quantization (FQ) that mixes shift and re-centralized quantization methods.


is used to select which quantized component could be better to fit the original distribution.

Knowledge Distillation

Geometry-Aware Distillation for Indoor Semantic Segmentation


Structured Knowledge Distillation for Semantic Segmentation

  • Teacher and Student

  • pair-wise distillation

    • learn association between each pair of pixels

  • pixel-wise distillation

    • simply learn classification results of each pixel

  • holistic distillation

    • Like GAN, with a discriminator network

Unifying Heterogeneous Classifiers with Distillation

  • Merge Heterogeneous Classifiers into a single one.

Typical Pruning Paradigm


Auto-balanced filter pruning for efficient convolutional neural networks

  • The word auto-balanced includes two meanings.

  • On the one hand, according to Eq. 13, the intensity of stimulation on strong filters varies with the weak ones. When the weak filters are zeroed out, the stimulation automatically stops and the training converges.

  • On the other hand, as the weak filters in a certain layer are weakened and the strong ones are stimulated, the representational capacity of the weak part is gradually transferred to the strong part, keeping the whole layer’s representational capacity unharmed.

Resistance & Prunability

We evaluate a training-based pruning method from two aspects.

  • Resistance

    • We say a model has high resistance if the performance maintains high during training.

  • Prunability

    • If the model endures a high pruning ratio with low performance drop, we say it has high prunability.

We desire both high resistance and prunability, but the traditional penalty-based paradigm naturally suffers from a resistance-prunability trade-off.

Two Key Components

ResRep comprises two key components: Convolutional Re-parameterization (Rep, the methodology of decoupling and the corresponding equivalent conversion) and Gradient Resetting (Res, the update rule for “forgetting”).


The lottery ticket hypothesis: Finding sparse, trainable neural networks

this hypothesis suggests that when training a large neural network, the successful training of the model is equivalent to finding a small, "winning" subnetwork (termed as the 'lottery ticket') within the large network. These subnetworks are sparse, i.e., only a certain small fraction of the parameters is non-zero.


Some methods repeat pruning-finetuning iterations to measure the importance and prune progressively.

A major drawback is that the pruned models can be easily trapped into bad local minima, and sometimes cannot even reach a similar level of accuracy with a counterpart of the same structure trained from scratch.

This discovery highlights the significance of perfect pruning, which eliminates the need for fine-tuning.

ResRep for Lossless Channel Pruning

Convolutional Re-parameterization

For every conv layer together with the following BN (if any) we desire to prune, which are referred to as the target layers, we append a compactor (1 × 1 conv) with kernel

Given a well-trained model W, we construct a re-parameterized model Wˆ by initializing the conv-BN as the original weights of W and Q as an identity matrix, so that the re-parameterized model produces the identical outputs as the original.


Details about how to cal them could be found in the paper.

Gradient Resetting

We describe how to produce structured sparsity in compactors while maintaining the accuracy, beginning by discussing the traditional usage of penalty on a specific kernel K to make the magnitude of some channels smaller, i.e.

we denote a specific channel in K by

competence-based importance evaluation


Dilemma Inside

  • Problem A: The penalty deviates the parameters of every channel from the optima of the objective function.

    • Notably, a mild deviation may not bring negative effects; e.g., L2 regularization can also be viewed as a mild deviation.

    • However, with a strong penalty, though some channels are zeroed out for pruning, the remaining channels are also made too small to maintain the representational capacity, which is an undesired side effect.

  • Problem B: With mild penalty for the high resistance, we cannot achieve high prunability, because most of the channels merely become closer to 0 than they used to be, but not close enough for perfect pruning.


We propose to achieve high prunability with a mild penalty by resetting the gradients derived from the objective function.


The Remembering Parts Remember Always, the Forgetting Parts Forget Progressively

To combine Res with Rep, we need to decide which channels of Q to be zeroed out.

