Generative Adversarial Networks (GANs) have achieved huge success in
generating high-fidelity images, however, they suffer from low efficiency due
to tremendous computational cost and bulky memory usage. Recent efforts on
compression GANs show noticeable progress in obtaining smaller generators by
sacrificing image quality or involving a time-consuming searching process. In
this work, we aim to address these issues by introducing a teacher network that
provides a search space in which efficient network architectures can be found,
in addition to performing knowledge distillation. First, we revisit the search
space of generative models, introducing an inception-based residual block into
generators. Second, to achieve target computation cost, we propose a one-step
pruning algorithm that searches a student architecture from the teacher model
and substantially reduces searching cost. It requires no l1 sparsity
regularization and its associated hyper-parameters, simplifying the training
procedure. Finally, we propose to distill knowledge through maximizing feature
similarity between teacher and student via an index named Global Kernel
Alignment (GKA). Our compressed networks achieve similar or even better image
fidelity (FID, mIoU) than the original models with much-reduced computational
cost, e.g., MACs. Code will be released at
https://github.com/s nap-research/CAT.
Teachers Do More Than Teach: Compressing Image-to-Image Models
教師が教えるよりも、画像と画像のモデルを圧縮する
0.55
Qing Jin1*
Qing Jin1。
0.85
Geng Yuan1 Yanzhi Wang1 1Northeastern University, USA
元元1 Yanzhi Wang1 1Northeastern University、米国。
0.67
Sergey Tulyakov2 2Snap Inc.
Sergey Tulyakov2 2Snap Inc.
0.88
Jian Ren2 Oliver J. Woodford*
慈安蓮2 オリバー・j・ウッドフォード
0.48
Jiazhuo Wang2
Jiazhuo Wang2
0.88
1 2 0 2 r a M 5 ] V C .
1 2 0 2 r a m 5 ] v c である。
0.80
s c [ 1 v 7 6 4 3 0 .
s c [ 1 v 7 6 4 3 0 .
0.85
3 0 1 2 : v i X r a
3 0 1 2 : v i X r a
0.85
Abstract Generative Adversarial Networks (GANs) have achieved huge success in generating high-fidelity images, however, they suffer from low efficiency due to tremendous computational cost and bulky memory usage.
Recent efforts on compression GANs show noticeable progress in obtaining smaller generators by sacrificing image quality or involving a time-consuming searching process.
In this work, we aim to address these issues by introducing a teacher network that provides a search space in which efficient network architectures can be found, in addition to performing knowledge distillation.
First, we revisit the search space of generative models, introducing an inception-based residual block into generators.
まず,生成モデルの探索空間を再検討し,インセプションに基づく残差ブロックを生成器に導入する。
0.78
Second, to achieve target computation cost, we propose a one-step pruning algorithm that searches a student architecture from the teacher model and substantially reduces searching cost.
Finally, we propose to distill knowledge through maximizing feature similarity between teacher and student via an index named Global Kernel Alignment (GKA).
Our compressed networks achieve similar or even better image fidelity (FID, mIoU) than the original models with much-reduced computational cost, e g , MACs.
Code will be released at https://github.com/s nap-research/CAT.
コードはhttps://github.com/s nap-research/CATで公開される。
0.46
1. Introduction Generative adversarial networks (GANs), which synthesize images by adversarial training [21], have witnessed tremendous progress in generating high-quality, high-resolution, and photo-realistic images and videos [5, 33, 68].
In conditional setting [54], the generation process is controlled via additional input signals, such as segmentation information [8, 58, 60, 71, 72], class labels [83], and sketches [29, 85].
These techniques have seen applications in commercial image editing tools.
これらの技術は商用画像編集ツールに応用されている。
0.68
However, due to their massive computation complexity and bulky size, applying generative models at scale is less practical, especially on resource-constrained platforms, where low memory foot*Work done while at Snap Inc.
To accelerate inference and save storage space for huge models without sacrificing performance, previous works propose to compress models with techniques including weight pruning [24], channel slimming [43, 44], layer skipping [4, 73], patterned or block pruning [17, 35, 40, 42, 49, 50, 51, 52, 56, 57, 82, 84], and network quantization [12, 18, 30, 31, 32, 38, 75].
Specifically, these studies elaborate on compressing discriminative models for image classification, detection, or segmentation tasks.
具体的には,画像分類,検出,セグメンテーションタスクの識別モデル圧縮について詳しく検討した。
0.66
The problem of compressing generative models, on the other hand, is less investigated, despite that typical generators are bulky in memory usage and inefficient during inference.
Up till now, only a handful of attempts exist [20, 36, 64, 70], all of which degenerate the quality of synthetic images compared to the original model (Fig.
Existing compression method [36] obtains an efficient student model and employs two additional networks: teacher and supernet, where the former is for knowledge distillation and the latter for architecture search.
We introduce a new network design that can be applied to both encoder-decoder architectures such as Pix2pix [29], and decoder-style networks such as GauGAN [58].
It serves as both the teacher network design, and the architecture search space of the student.
教師ネットワークの設計と、学生のアーキテクチャ検索スペースの両方として機能します。
0.71
2. We directly prune the trained teacher network using an efficient, one-step technique that removes certain channels in its generators to achieve a target computation budget, e g , the number of Multiply-Accumulate Operations (MACs).
Furthermore, our pruning method only involves one hyperparameter, making its application straightforward.
さらに, プルーニング法は1つのハイパーパラメータしか含まないため, 適用は容易である。
0.59
3. We introduce a knowledge distillation technique based on the similarity between teacher and student models’ feature spaces, which we call global kernel alignment (GKA).
GKA directly forces feature representations from the two models to be similar, and avoids extra learnable layers [36] to match the different dimensions of teacher and student feature spaces, which could otherwise lead to information leakage.
We name our method as CAT as we show teacher model can and should do Compression And Teaching (distillation) jointly, which we find is beneficial for finding generative networks with smaller MACs, using much lower computational resource than prior work.
Although they can compress the original models (e g , CycleGAN [85]) to a relatively small MACs, all these methods suffers from sacrifice on performance.
Applying these methods directly on generative models can lead to inferior performance of compressed models than their original counterparts.
これらの手法を生成モデルに直接適用することで、圧縮モデルの性能は元のモデルよりも劣る可能性がある。
0.65
For example, Shu et al [64] employ an evolutionary algorithm [59] and Fu et al [20] engage differentiable network design [39], while Li et al [36] train a supernet with random sampling technique [6, 23, 79, 80] to select the optimal architecture.
例えば、shu et al [64] では進化アルゴリズム [59] と fu et al [20] が微分可能なネットワーク設計 [39] を採用し、li et al [36] はランダムサンプリング手法 [6, 23, 79, 80] でスーパーネットを訓練して最適なアーキテクチャを選択する。
0.87
The common key drawback of these methods is the slow searching process.
これらの方法の共通する欠点は、遅い探索プロセスである。
0.75
In contrast, directly pruning on a pre-trained model is much faster.
対照的に、事前訓練されたモデルで直接刈り取るのはずっと速い。
0.66
Following previous methods of network slimming [43, 44], Wang et al [70] apply (cid:96)1 regularization to generative models for channel pruning.
ネットワークスリム化 [43, 44] の以前の方法に従い、Wang et al [70] はチャンネル切断のための生成モデルに (cid:96)1 正規化を適用する。
0.74
However, they report performance degradation compared to the original network.
しかし、元のネットワークに比べて性能が低下している。
0.66
Besides, these pruning methods require tuning additional hyperparameters for (cid:96)1 regularization to encourage channel-wise sparsity [43, 44] and even more hyper-parameters to decide the number of channels to be pruned [53], making the process tedious.
Recently, lottery ticket hypothesis [19] is also investigated on GAN problem [2], while the performance is not satisfactory.
近年,GAN問題[2]では抽選券仮説[19]も検討されているが,性能は不十分である。
0.72
Knowledge distillation [26] is a technique to transfer knowledge from a larger, teacher network to a smaller, student network, and has been used for model compression in various computer vision tasks [9, 10, 45, 48, 76].
A recent survey [22] categorizes knowledge distillation as responsebased, feature-based, or relation-based.
最近の調査[22]では、知識蒸留を反応ベース、特徴ベース、関係ベースと分類している。
0.57
Most GAN compression methods [1, 10, 20] use response-based distillation, enforcing the synthesized images from the teacher and student networks to be the same.
Li et al [36] apply featurebased distillation by introducing extra layers to match feature sizes between the teacher and student, and minimizing the differences of these embeddings using mean squared error (MSE) loss.
Li et al [36] は,教師と生徒の特徴量に合わせて余分な層を導入し,平均二乗誤差(MSE)損失を用いてこれらの埋め込みの違いを最小化することにより,特徴に基づく蒸留を適用した。
0.69
However, this has the potential problem that some information can be stored in those extra layers, without being passed on to the student.
Towards this end, we adopt the widely used inception module on discriminative models [53, 65, 87] to the image generators and propose the inception-based residual block (IncResBlock).
A conventional residual block in generators only contains convolution layers with one kernel size (e g , 3 × 3), while in IncResBlock, as shown in Fig 2, we introduce convolution layers with different kernel sizes, including 1× 1, 3× 3, and 5 × 5.
Additionally, we incorporate depth-wise blocks [27] into IncResBlock as depth-wise convolution layers typically require less computation cost without sacrificing the performance, and are particularly suitable for models deployed on mobile devices [62].
To achieve similar total computation cost, we set the number of output channels for the first convolution layers of each operations to that of the original residual blocks divided by six, which is the number of different operations in the IncResBlock.
More details are illustrated in the supplementary materials.
詳細は補足資料に記載されている。
0.56
3.2. Search from Teacher Generator via Pruning With the teacher network introduced, we search a compressed student network from it.
3.2. The teacher network introduced, we search from Teacher Generator via Pruning, we search a compressed students network from it。
0.78
Our searching algorithm includes two parts.
我々の探索アルゴリズムは2つの部分を含む。
0.64
The first one is deciding a threshold based on the given computational budget, and the second one is pruning channels with a scale less than a threshold.
Automatically threshold searching. Following existing efforts [43, 44], we prune the channels through the magnitudes of scaling factors in normalization layers, such as Batch Normalization (BN) [28] and Instance Normalization (IN) [69].
Figure 2: IncResBlock includes three conventional convolution blocks and three depth-wise convolution blocks (dashed border), both with kernels sizes of 1, 3, 5.
A normalization layer that can be inserted after summing features from the six blocks and the residual connection are optional.
6つのブロックから特徴を要約した後に挿入できる正規化層と残りの接続はオプションである。
0.75
Unless otherwise stated, both are applied by default.
そうでなければ、どちらもデフォルトで適用される。
0.70
tion models and introduce inception-based residual blocks (Sec.
調律モデルとインセプションに基づく残差ブロック(sec)の導入
0.70
3.1). The teacher model is built upon the proposed block design and can serve two purposes.
3.1). 教師モデルは提案されたブロック設計に基づいて構築され、2つの目的を果たすことができる。
0.73
First, we show that the teacher model can be viewed as a large search space that enables one-shot neural architecture search without training an extra supernet.
By maximizing the similarity between intermediate features of teacher and student network directly, where features of the two networks contain different numbers of channels, we can effectively transfer knowledge from teacher to student (Sec.
The optimization of supernet can lead to extra training costs.
スーパーネットの最適化は、追加のトレーニングコストにつながる可能性がある。
0.56
However, as we already have a teacher network in hand, searching efficient student from the teacher model should be more straightforward, as long as the teacher network contains a large searching space.
In this way, the teacher network can perform both knowledge distillation and provide search space.
このように、教師ネットワークは知識蒸留と検索スペースの両方を行うことができます。
0.74
Therefore, the goal of obtaining a good supernet can be changed to design a teacher generator that can synthesize high fidelity images; and itself contains a reasonable search space.
With the above goal bearing in mind, we design a new architecture for the image generation tasks so that a pre-trained teacher generator with such architecture can serve as a large search space.
We aim to search for a smaller student network that can have
私たちは小さな学生ネットワークを 探そうとしています
0.75
3 1 x 11 x 13 x 33 x 35 x 55 x 51 x 11 x 11 x 11 x 11 x 13 x 31 x 11 x 15 x 5IncResBlockIncSPADE ResBlkIncSPADEReLUIn ception ResBlock1 x 1-ConvIncSPADEReLUIn cSPADESync BNResize (order=0)Inception ResBlock
3 1 x 11 x 13 x 33 x 35 x 55 x 51 x 11 x 11 x 11 x 13 x 31 x 15 x 5IncResBlockIncSPADE ResBlkIncSPADEReLUIn ception ResBlock1 x 1-ConvIncSPADEReLUIn cSPADESync BNResize (order=0) Inception ResBlock
0.78
英語(論文から抽出)
日本語訳
スコア
We find the scale threshold by binary search on the scaling factors of normalization layers from the pre-trained teacher model.
事前学習した教師モデルから正規化層のスケーリング因子を二分探索することで,尺度閾値を求める。
0.75
Specifically, we temporarily prune all channels with a scaling factor magnitude smaller than the threshold and measure the computational cost of the pruned model.
If it is smaller than the budget, the model is pruned too much and we search in the lower interval to get a smaller threshold; otherwise, we search in the upper interval to get a larger value.
During this process, we also keep the number of output channels for convolution layers outside the IncResBlock larger than a pre-defined value to avoid an invalid model.
Details of the algorithm are illustrated in Algorithm 1.
アルゴリズムの詳細はアルゴリズム1に記載されている。
0.75
Channel pruning.
チャンネルの刈り込み。
0.73
With the threshold decided, we perform network searching via pruning.
しきい値を決定すると、プルーニングによるネットワーク探索を行う。
0.70
Given an IncResBlock, it is possible to change both the number of channels in each layer and modify the operation, such that, e g , one IncResBlock may only include layers with kernel sizes 1 × 1 and 3 × 3.
Similar to Mei et al [53], we prune channels of the normalization layers together with the corresponding convolution layers.
Mei et al [53]と同様に、正規化層のチャンネルを対応する畳み込み層と共にプルーンします。
0.69
Specifically, we prune the first normalization layers for each operation in IncResBlock, namely the ones after the first k× k convolution layers for conventional operations and the ones after the first 1× 1 convolution layers for depth-wise operations.
具体的には、IncResBlockにおける各操作に対する最初の正規化層、すなわち従来の操作に対する最初の k× k 畳み込み層、深度演算のための最初の 1× 1 畳み込み層の後である。
0.78
Algorithm 1 Searching via One-Step Pruning.
アルゴリズム1 ワンステッププルーニングによる探索
0.76
Require: Computational budget Tb, teacher model GT, scaling factors γ(l) (used for pruning) of the i-th channel in normalization layers N (l)∈GT, minimum # outi put channels clb for convolution layers (outside the IncResBlock).
i i,l 2: Initialize scale upper bound γhi: γhi ← max |.
i,l 2: 初期化スケール上界 γhi: γhi > max |。
0.77
|γ(l) i i,l 3: while γlo < γhi do γth ← (γlo + γhi)/2 4: Prune channels satisfying |γ(l) 5: i keep clb to get GS T ← computational cost of GS 6: if T > Tb then 7: γlo ← γth 8: else 9: γhi ← γth 10: end if 11: 12: end while
γ(l) i i,l 3: while γlo < γhi do γth > (γlo + γhi)/2 4: Prune channel fulfilling |γ(l) 5: i keep clb to get GS T > compute cost of GS 6: if T > Tb then 7: γlo > γth 8: else 9: γhi > γth 10: end if 11: 12: end while
0.89
| < γth on GT while
| < γth on GT while
0.97
Discussion. Our searching algorithm is different from previous works that focus on compressing generative models in the following three perspectives.
of the normalization layers in the pre-trained teacher network are sufficient for pruning, therefore, weight regularization for iterative pruning [53, 70] might not be necessary for the generation tasks.
Third, the teacher network can be compressed to several different architectures, and we can find the student network that satisfies an arbitrary type of computational cost, e g , MACs, under any value of predefined budget during the searching directly.
First, searching cost is significantly reduced without introducing extra network.
まず、追加のネットワークを導入することなく検索コストを大幅に削減する。
0.62
Second, removing the weight regularization, e g , (cid:96)1-norm, eases the searching process as a bunch of hyper-parameters are reduced, which we find are hard to tune in practice.
Third, we have more flexibility to choose a student network with required computational cost.
第三に、必要な計算コストで学生ネットワークを選択する柔軟性があります。
0.78
3.3. Distillation from Teacher Generator After obtaining a student network architecture, we train it from scratch, leveraging the teacher model for knowledge distillation.
In particular, we transfer knowledge between the two networks’ feature spaces, since this has been shown [36] to achieve better performance than reconstructing images synthesized by the teacher [20].
With different numbers of channels between teacher and student layers, Li et al [36] introduce auxiliary, learnable layers that project the student features into the same dimensional space as the teacher, as shown in Fig 3.
教師層と生徒層の間に異なる数のチャンネルがあり、図3に示すように、Li et al [36] は生徒の特徴を教師と同じ次元空間に投影する補助的な学習可能なレイヤを導入している。
0.79
Whilst equalizing the number of channels between the two networks, these layers can also impact the efficacy of distillation, since some information can be stored in these extra layers.
To avoid information loss, we propose to encourage similarity between the two feature spaces directly.
情報喪失を避けるため,我々は2つの特徴空間間の類似性を直接促進する。
0.77
3.3.1 Similarity-based Knowledge Distillation We develop our distillation method based on centered kernel alignment (CKA) [14, 15], a similarity index between two matrices, X ∈ Rn×p1 and Y ∈ Rn×p2, where after centering the kernal alignment (KA) is calculated, which is defined as1
(cid:107)Y TX(cid:107)2 F (cid:107)X TX(cid:107)F(cid:107 )Y TY (cid:107)F
(cid:107)Y TX(cid:107)2 F (cid:107)X TX(cid:107)F(cid:107 )Y TY (cid:107)F
0.83
. (1) It is invariant to an orthogonal transform and isotropic scaling of the rows, but is sensitive to an invertible linear transImportantly, p1 and p2 can differ.
al. [34] use this index to compute the similarity between different learned feature representations of varying lengths (p1 = hwc1 & p2 = hwc2, where h, w and c· are the height, width and number of channels of the respective layer tensors; n is the batch size).
1The identity (cid:107)Y TX(cid:107)2 F = (cid:104)vec(XX T), vec(Y Y T)(cid:105) is used to achieve computational complexity of O(n2hw max(c1, c2)) [34].
1 同一性(cid:107)Y TX(cid:107)2 F = (cid:104)vec(XX T), vec(Y Y T)(cid:105)は、O(n2hw max(c1,c2))[34]の計算複雑性を達成するために用いられる。
0.81
4
4
0.85
英語(論文から抽出)
日本語訳
スコア
LT = λadvLadv + λreconLrecon + λdistLdist.
LT = λadvLadv + λreconLrecon + λdistLdist
0.93
where x and y denote the input and real images, and D and G denote the discriminator and generator, respectively.
ここで x と y は入力画像と実画像を表し、D と G はそれぞれ判別器とジェネレーターを表します。
0.75
Full objective for student. For the training of student generator for CycleGAN, we adopt the setting from [36] where we use the data generated from teacher network to form paired data and train the student the same way as Pix2pix with a reconstruction loss Lrecon.
Therefore, for CycleGAN and Pix2pix, the overall loss function for student training is: (5) For the training of GauGAN, there is an additional feature matching loss Lfm [72], and the overall loss function is as follows: LT = λadvLadv+λreconLrecon+λfmLfm+λdistLdist.
Following [36], we inherit the teacher discriminator by using the same architecture and the pre-trained weights, and finetune it with the student generator for student training.
Right: Our proposed GKA maximizes similarity between features directly.
右:提案したGKAは機能間の類似性を最大化します。
0.62
Global-KA. To compare similarity between teacher and student features, we introduce a similar metric called Global-KA (GKA), where for the same two tensors X and Y defined in Eqn.
グローバルKA。 教師と生徒の特徴の類似性を比較するために、Eqnで定義された2つのテンソル X と Y に対して、GKA (Global-KA) と呼ばれる同様の計量を導入する。
0.72
1, GKA is defined as follows: (2) GKA(X, Y ) = KA(ρ(X), ρ(Y )), where ρ : Rn×hwc → Rnhw×c is a simple reshape operation on the input matrix.
Unlike CKA, which sums similarity between two batches of features over channels and spatial pixels, and describes batch-wise similarity, GKA sums feature similarity over channels, characterizing both batchwise and spatial-wise similarity.
The computational complexity of this operation is O(nhw max(c1, c2)2), which is lower than CKA if the batch size is much larger than the channel numbers.
To perform distillation, we maximize the similarity between features of teacher and student networks by maximizing GKA.
蒸留を行うため,GKAを最大化することにより,教師と学生のネットワークの特徴の類似性を最大化する。
0.61
Note that different from CKA, for GKA we do not center the two tensors X and Y .
CKA とは異なり、GKA では 2 つのテンソル X と Y は中心にしません。
0.67
However, we find that centering does not introduce much difference on the final performance.
しかし、センタリングは最終的なパフォーマンスに大きな違いをもたらさないことがわかった。
0.62
3.3.2 Distillation Loss We conduct distillation on the feature space.
3.3.2 蒸留損失 特徴空間で蒸留を行う。
0.71
Let SKD denote the set of layers for performing knowledge distillation, whereas X (l) and X (l) s denote feature tensors of layer l from t the teacher and student networks, respectively.
SKD は知識蒸留を行うための層の集合を表し、X (l) と X (l) s はそれぞれ教師ネットワークと学生ネットワークから層 l のテンソルを特徴とする。
0.73
We miniLdist = − (cid:88) mize the distillation loss Ldist as follows: GKA(X (l) , X (l) s ), t l∈SKD where the minus sign is introduced as we intend to maximize feature similarity between student and teacher models.
GKA(X (l) , X (l) s ), t l∈SKD ここでマイナス記号が導入され、学生モデルと教師モデルの間の特徴類似性を最大化しようとしている。 訳抜け防止モード: We miniLdist = − (cid:88 ) mize the distillation loss Ldist: GKA(X ( l )。 X ( l ) s ) t l∂SKD で、マイナス符号が導入された。 生徒モデルと教師モデルの特徴的類似性を最大化するつもりです
0.81
3.4. Learning We train teacher networks using the original loss functions, which includes an adversarial loss Ladv as follows: Ladv = Ex,y [log D(x, y)] + Ex [log(1 − D(x, G(x)))] , (4)
Method Original [85, 36] Shu et al [64] AutoGAN Distiller [20] GAN Slimming [70] GAN Lottery [2] Li et al [36] CAT (Ours) Original [85, 70] GAN Slimming [70] CAT (Ours) Original [29, 36] Li et al [36] CAT (Ours) Original [29, 36] Li et al [36] CAT (Ours) Original [58, 36] Li et al [36] CAT-A (Ours) CAT-B (Ours)
Method Original [85, 36] Shu et al [64] AutoGAN Distiller [20] GAN Slimming [70] GAN Lottery [2] Li et al [36] CAT (Ours) Original [85, 70] GAN Slimming [70] CAT (Ours) Original [29, 36] Li et al [36] CAT (Ours) Original [29, 36] Li et al [36] CAT (Ours) Original [58, 36] Li et al [36] CAT-A (Ours) CAT-B (Ours)
Table 2: Further quantitative comparison on KID between different compression techniques for Image-to-Image models, where lower KID indicates better performance.
For example, on CycleGAN, our method results in a large compression ratio as the MACs is saved from 56.8B to 2.55B (22.3×) or 2.59B (21.9×), while at the same time, 61.53 to 60.18 for Horse(cid:1)Zebra and from 148.8 to 142.7 the model gets better performance as FID is reduced from for Zebra(cid:1)Horse.
For the Cityscapes dataset with Pix2pix model, we compress the model from 56.8B to 5.57B MACs, which is 10.2× smaller, while increase the mIoU from 42.06 to 42.53.
Again, for Pix2pix on the Map(cid:1)Aerial photo dataset, the MACs is reduced from 56.8B to 4.59B by our method, with a compression ratio of 12.4×, whereas the FID is improved and reduced from 47.76 to 44.94.
We choose 5.6B as it is similar to our compressed Pix2pix model on Cityscapes.
私たちはCityscapesの圧縮Pix2pixモデルに似ているので5.6Bを選びます。
0.64
We find that with 30B MACs, which is 9.4× smaller than GauGAN, the mIoU of our model is better
GauGANよりも9.4×小さい30B MACでは、私たちのモデルのmIoUの方が優れています。
0.76
6
6
0.85
英語(論文から抽出)
日本語訳
スコア
Figure 4: Qualitative results on Cityscapes dataset.
図4:Cityscapesデータセットの質的な結果。
0.80
Images generated by our compressed model (CAT-A, third row) have higher mIoU and lower FID than the original GauGAN model (fifth row), even with much reduced computational cost.
For our CAT-B model (fourth row, 50.9× compressed than GauGAN), although it has lower mIoU, the CAT-B model can synthesize higher fidelity images (lower FID) than GauGAN.
For Horse(cid:1)Zebra on CycleGAN, our method can synthesize better zebra images for challenging input horse images, where the CycleGAN fails to generate.
The examples shown in Fig 4 & 5 demonstrate that our compression technique is an effective method for saving the computational cost of generative models.
Besides, the compressed models can surpass the original models, even though they require much reduced computational cost and, thus, are more efficient during inference.
These results indicate significant redundancy in the original large generators, and it is worth further studying the extreme of these generative models in terms of performance-efficiency trade-off.
Analysis of searching cost. Here we show the analysis of searching costs for finding a student network.
検索コストの分析。 ここでは、学生ネットワークを見つけるための検索コストの分析を示します。
0.78
Our method can search the architecture under a pre-defined computational budget with a much reduced searching cost compared with previous state-of-the-art compressing method [36].
Tab. 3 provides the searching cost of the two methods on various datasets and models.
Tab。 さまざまなデータセットとモデルの2つのメソッドの検索コストを提供する。
0.82
As can be seen, our method is at least 10, 000× times faster for searching.
ご覧のとおり、私たちの方法は検索のために少なくとも10,000×倍高速です。
0.68
The searching time for the previous method [36] is estimated by only including the time for training a supernet, which is designed
前回の方法[36]の探索時間は、設計されたスーパーネットの訓練時間のみを含むと推定される。
0.79
than the original, which is increased from 62.18 to 62.35.
オリジナルの62.18から62.35に増加した。
0.77
We further compress the model to less than 5.6B with a compression ratio of 50.9×, and the mIoU is reduced to 54.71.
さらに50.9×の圧縮比で5.6B未満に圧縮し、mIoUは54.71に減少する。
0.79
However, it is still much better than that from the Pix2pix model.
しかし、Pix2pixモデルよりもずっと優れています。
0.75
These demonstrate that our method is a sound technique for compressing image-to-image models, and provides the state-of-the-art trade-off between computation complexity and image generation performance.
Our compressed model (CAT-A) achieves better quality (higher mIoU and lower FID) than GauGAN.
圧縮モデル(CAT-A)はGauGANよりも優れた品質(mIoU,FID)を実現する。
0.69
For example, for the leftmost image in Fig 4, the back of the car synthesized by CAT-A is clearer than GauGAN, and CAT-A generates less blurry human images than GauGAN for the rightmost image.
Table 3: Architecture search cost, measured in seconds of GPU computation, for our method vs. Li et al [36], across different models.
表3: アーキテクチャ検索コストは、gpu計算の秒単位で測定されます。 訳抜け防止モード: 表3:gpu計算の秒単位のアーキテクチャ検索コスト 私たちの方法と li et al [ 36 ] では,さまざまなモデルにまたがっています。
0.78
Model CycleGAN
モデル CycleGAN
0.82
Pix2pix GauGAN
Pix2pix ゴーガン
0.55
Dataset Method Horse(cid:1)Zebra Li et al [36] CAT (Ours) Zebra(cid:1)Horse Li et al [36] CAT (Ours) Li et al [36] Cityscapes CAT (Ours) Map(cid:1)Aerial photo Li et al [36] CAT (Ours) Li et al [36] CAT-A (Ours) Cityscapes CAT-B (Ours)
Dataset Method Horse(cid:1)Zebra Li et al [36] CAT (Ours) Zebra(cid:1)Horse Li et al [36] CAT (Ours) Li et al [36] Cityscapes CAT (Ours) Map(cid:1)Aerial photo Li et al [36] CAT (Ours) Li et al [36] CAT-A (Ours) Cityscapes CAT-B (Ours)
image tasks. We show the problem can be tackled by using a powerful teacher model, which is not restricted to teach a student through knowledge distillation, but can serve as a supernet to search efficient architecture (for student) under pre-defined computational budgets.
We also introduce a similarity-based knowledge distillation technique to train student network, where feature similarity between student and teacher is measured directly by the proposed GKA index.
With our method, we can obtain networks that have similar or even better performance than original Pix2pix, CycleGAN, and GauGAN models on various datasets.
More importantly, our networks have much reduced MACs than their original counterparts.
さらに重要なのは、われわれのネットワークがオリジナルのmacよりはるかに少ないことだ。
0.63
Our work demonstrates that there remains redundancy in existing generative models, and we can achieve improved performance, e g , synthesizing images with better fidelity, with much reduced computational cost.
It is worth further investigating the ability of generative models to synthesize images with high quality under an extremely constrained computational budget, which we leave for future study.
Compared with original networks (Pix2pix and CycleGAN), our models have much reduced MACs and can generate images with higher fidelity (lower FID) by synthesizing textures that are not well-handled by the original large models.
for architecture search. We estimate it as 20 hours with 1 GPU for the CycleGAN and Pix2pix models and 40 hours with 8 GPUs for the GauGAN model, both of which are much shorter than those required in practice and thus serves as a lower bound.
For example, for Cityscapes with Pix2pix model, the supernet includes more than 5, 000 possible architectures, and each requires around 3 minutes with 1 GPU for evaluation, resulting in several days of architecture search.
Autocompress: An automatic dnn structured pruning framework for ultra-high compression rates.
autocompress: 超高圧縮率のための自動dnn構造化プルーニングフレームワーク。
0.83
In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 4876–4883, 2020.
AAAI Conference on Artificial IntelligenceのProceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, Volume 34, page 4876–4883, 2020。
0.65
2 [42] Shaoshan Liu, Bin Ren, Xipeng Shen, and Yanzhi Wang.
[59] Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Jie Tan, Quoc Le, and Alex Kurakin.
59] Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Jie Tan, Quoc Le, Alex Kurakin。 訳抜け防止モード: 59 ] Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Jie Tan, Quoc Le アレックス・クラキン(Alex Kurakin)。
To prune the input channel for each model, we add an extra normalization layer (synchronized batch normalization) with learnable weights after the first fully-connected layer, and prune its channels together with other normalizations using our pruning algorithm described in the Section 3.2 of the main paper.
During pruning, we keep the ratio of input channels between different layers as the original model, and the lower bound for the first layer (which has the largest number of channels) is determined by that for the last layer multiplied by the channel ratio, so that all channels are above the bound and the channel ratio is unchanged.
tion Here we show the ablation analysis for knowledge distillation methods.
ここでは, 知識蒸留法のアブレーション解析について述べる。
0.67
We use our searching method to find a student architecture on Pix2pix task using the Cityscapes dataset, and compare student training without knowledge distillation, with MSE distillation as in [36], and the similarity-based distillation we proposed.
S2, where w/o Distillation denotes training the student without distillation, and w/ MSE; Loss Weight 0.5 and w/ MSE; Loss Weight 1.0 denotes MSE distillation with weight 0.5 and 1.0, respectively.
We find that distillation indeed improves performance, and our distillation method, which employs GKA to maximize feature similarity, is better than MSE on transferring knowledge from teacher to student via intermediate features.
S3. More Qualitative Results Horse(cid:1)Zebra and Zebra(cid:1)Horse, Pix2pix on Map(cid:1)Aerial We show more qualitative results for CycleGAN on photo, as well as GauGAN on Cityscapes in Figs.
Appendix S1. Implementation Details In this section, we provide more implementation details in our work.
付録S1。 実装の詳細 このセクションでは、作業の詳細について説明します。
0.63
Training details. For CycleGAN and Pix2pix models, we use batch size of 32 for teacher and batch size of 80 for student, while for GauGAN, the batch size is set to 16 for both.
More detailed training hyperparameters are summarized in Table S1.
より詳細なトレーニングハイパーパラメータは、テーブルS1にまとめられている。
0.57
For the layers used for knowledge distillation between teacher and student networks, we follow the same strategy as Li et al [36].
教師と学生のネットワーク間の知識蒸留に使用される層については、Li et al [36]と同じ戦略に従います。
0.76
Specifically, for Pix2pix and CycleGAN models, the 9 residual blocks are divided into 3 groups, each with three consecutive layers, and knowledge is distilled upon the four activations from each end layer of these three groups.
For GauGAN models, knowledge distillation is applied on the output activations of 3 from the total 7 SPADE blocks, including the first, third and fifth ones.
We find that instance normalization [69] without tracking running statistics is critical for dataset Horse→Zebra to achieve good performance on the student model, while for the other datasets batch normalization [28] with tracked running statistics is better.
Normalization layers without track running statistics introduce extra computation cost, and we take this into account for our calculation of MACs during pruning.
Moreover, for GauGAN, we use synchronized batch normalization as suggested by previous work [58, 67], and remove the spectral norm [55] as we find it does not have much impact on the model performance.
For GauGAN, we find it is sufficient for each spade residual block to keep only the first SPADE module in the main body while replace the second one as well as the one in the shortcut by synchronized batch normalization layer.
Besides, we use learnable weights for the second synchronized block for the purpose of pruning.
また,第2同期ブロックに対して学習可能な重み付けを用いてプルーニングを行う。
0.73
These weights do not introduce extra computation cost, as the running statistics are estimated from training data and not recalculated during inference, enabling fusing normalization layers into the convolution layers.
Further, we replace the three convolution layers in the SPADE module by our proposed inception-based residual block (IncResBlock), with normalization layers included for pruning.
Note that the optional last normalization layer and residual connection are not applied in the Inception Resblocks that are used in IncSPADE and IncSPADE ResBlk.
Table S2: Analysis of knowledge distillation methods on Cityscapes dataset with the Pix2pix setting.
表 S2: Pix2pix設定によるCityscapesデータセットの知識蒸留手法の解析
0.86
Our methods (GKA) achieves the best result.
我々の手法(GKA)は最良の結果をもたらす。
0.80
Method w/o Distillation w/ MSE; Loss Weight 0.5 w/ MSE; Loss Weight 1.0
方法 w/o蒸留 w/MSE、損失重量0.5w/MSE、損失重量1.0
0.66
Ours mIoU↑ 39.39 39.83 39.76 42.53
我々の 39.39.83 39.76 42.53
0.56
14 1 x 11 x 13 x 33 x 35 x 55 x 51 x 11 x 11 x 11 x 11 x 13 x 31 x 11 x 15 x 5IncResBlockIncSPADE ResBlkIncSPADEReLUIn ception ResBlock1 x 1-ConvSync BNReLUIncSPADEResize (order=0)Inception ResBlockSync BN1 x 11 x 13 x 33 x 35 x 55 x 51 x 11 x 11 x 11 x 11 x 13 x 31 x 11 x 15 x 5IncResBlockIncSPADE ResBlkIncSPADEReLUIn ception ResBlock1 x 1-ConvSync BNReLUIncSPADEResize (order=0)Inception ResBlockSync BN
14 1 x 11 x 13 x 33 x 35 x 55 x 51 x 11 x 11 x 11 x 11 x 13 x 31 x 11 x 15 x 5IncResBlockIncSPADE ResBlkIncSPADEReLUIn ception ResBlock1 x 1-ConvSync BNReLUIncSPADEResize (order=0)Inception ResBlockSync BN1 x 11 x 13 x 33 x 55 x 51 x 11 x 11 x 11 x 11 x 11 x 31 x 11 x 11 x 15 x 5IncResBlockIncSPADE ResBlkIncSPADEReLUIn ception ResBlock1 x 1-ConvSync BNResize (order=0)Inception ResBlock BNBlock BN
0.77
英語(論文から抽出)
日本語訳
スコア
Figure S2: More results on Horse(cid:1)Zebra dataset.
図 S2: Horse(cid:1)Zebra データセットのさらなる結果。
0.80
Compared with original CycleGAN, our model has much reduced MACs and can generate images with higher fidelity (lower FID).
Figure S5: More qualitative results on Cityscapes dataset.
図 S5: Cityscapesデータセットのより質的な結果。
0.80
Images generated by our compressed model (CAT-A, third row) have higher mIoU and lower FID than the original GauGAN model (fifth row), even with much reduced computational cost.
For our CAT-B model (fourth row, 50.9× compressed than GauGAN), although it has lower mIoU, the CAT-B model can synthesize higher fidelity images (lower FID) than GauGAN.