DepGraph: Towards Any Structural Pruning

Abstract

The most prominent obstacle towards this goal lies in the structural coupling, which not only forces different layers to be pruned simultaneously, but also expects all removed parameters to be consistently unimportant, thereby avoiding structural issues and significant performance degradation after pruning.

Introduction

deep neural compression

Train to find unnecessary connection, prune them and re-train https://arxiv.org/abs/1506.02626
Teacher and Student https://arxiv.org/abs/1503.02531
GNN + Binary Representations + Two Kinds of Aggregators https://arxiv.org/abs/2109.12872
dataset condensation like Factorization, etc. https://arxiv.org/abs/2210.16774
CNN Quantization https://ieeexplore.ieee.org/document/7780890/

pruning schemes

structural pruning

Centripetal SGD; Filters cluster in CNN https://arxiv.org/pdf/1904.03837.pdf
Filter Pruning https://arxiv.org/abs/1608.08710

unstructured pruning

layer-wise pruning https://arxiv.org/abs/1705.07565
magnitude-based pruning https://arxiv.org/abs/2002.04809

Structural and Unstructured Pruning

In practice, unstructured pruning, in particular, is straightforward to implement and inherently adaptable to various networks.

However, it often necessitates specialized AI accelerators or software for model acceleration [15].

Conversely, structural pruning improves the inference overhead by physically removing parameters from networks, thereby finding a wider domain of applications [29, 38].

In the literature, The design space of pruning algorithms encompasses a range of aspects, including pruning schemes [21, 39], parameter selection [20, 43, 44], layer sparsity [27, 49] and training techniques [47, 58].

Pruning Grouped Parameters

Method

Dependency in Neural Networks

(a) Basic dependency
(b) Residual dependency
(c) Concatenation dependency
(d) Reduction dependency

Dependence Graph

Find a grouping matrix

with

indicating that the presence of dependency between i-th layer and j-th layer.

In this matrix, G_ij is not only determined by the i-th and j-th layers but also affected by those intermediate layers between them.

This recursive process (inferred from w1 ⇔ w2 and w2 ⇔ w3) ultimately ends with a transitive relation, w1 ⇔ w2 ⇔ w3. In this case, we only need two dependencies to describe the relations in group g.

Thus, it can be compressed into a more compact form with fewer edges while retaining the same information.

Formally, D is constructed such that, for all G_ij = 1, there exists a path in D between vertex i and j. Therefore, Gij can be derived by examing the presence of a path between vertices i and j in D.

network decomposition

To remedy these issues, we propose a new notation to decompose a network F(x; w) into finer and more basic components, denoted as F = {f1, f2, ..., fL}

where each component f refers to either a parameterized layer such as convolution or a non-parameterized operation such as residual adding.

Dependency Modeling

Group-level Pruning

Specifically, for each parameter w with K prunable dimensions indexed by w[k], we introduce a simple regularization term for sparse training, defined as:

g is a parameter group consisting of multiple parameters. k is a dimension prunable.

After sparse training, we further use a simple relative score

to identify and remove unimportant parameters.