TransKD: Transformer Knowledge Distillation for Efficient Semantic
Segmentation
- URL: http://arxiv.org/abs/2202.13393v3
- Date: Sun, 24 Dec 2023 07:59:29 GMT
- Title: TransKD: Transformer Knowledge Distillation for Efficient Semantic
Segmentation
- Authors: Ruiping Liu, Kailun Yang, Alina Roitberg, Jiaming Zhang, Kunyu Peng,
Huayao Liu, Yaonan Wang, Rainer Stiefelhagen
- Abstract summary: Transformer-based Knowledge Distillation (TransKD) framework learns compact student transformers by distilling both feature maps and patch embeddings of large teacher transformers.
Experiments on Cityscapes, ACDC, NYUv2, and Pascal VOC2012 datasets show that TransKD outperforms state-of-the-art distillation frameworks and rivals the time-consuming pre-training method.
- Score: 51.93878604106518
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Semantic segmentation benchmarks in the realm of autonomous driving are
dominated by large pre-trained transformers, yet their widespread adoption is
impeded by substantial computational costs and prolonged training durations. To
lift this constraint, we look at efficient semantic segmentation from a
perspective of comprehensive knowledge distillation and consider to bridge the
gap between multi-source knowledge extractions and transformer-specific patch
embeddings. We put forward the Transformer-based Knowledge Distillation
(TransKD) framework which learns compact student transformers by distilling
both feature maps and patch embeddings of large teacher transformers, bypassing
the long pre-training process and reducing the FLOPs by >85.0%. Specifically,
we propose two fundamental and two optimization modules: (1) Cross Selective
Fusion (CSF) enables knowledge transfer between cross-stage features via
channel attention and feature map distillation within hierarchical
transformers; (2) Patch Embedding Alignment (PEA) performs dimensional
transformation within the patchifying process to facilitate the patch embedding
distillation; (3) Global-Local Context Mixer (GL-Mixer) extracts both global
and local information of a representative embedding; (4) Embedding Assistant
(EA) acts as an embedding method to seamlessly bridge teacher and student
models with the teacher's number of channels. Experiments on Cityscapes, ACDC,
NYUv2, and Pascal VOC2012 datasets show that TransKD outperforms
state-of-the-art distillation frameworks and rivals the time-consuming
pre-training method. The source code is publicly available at
https://github.com/RuipingL/TransKD.
Related papers
- CT-MVSNet: Efficient Multi-View Stereo with Cross-scale Transformer [8.962657021133925]
Cross-scale transformer (CT) processes feature representations at different stages without additional computation.
We introduce an adaptive matching-aware transformer (AMT) that employs different interactive attention combinations at multiple scales.
We also present a dual-feature guided aggregation (DFGA) that embeds the coarse global semantic information into the finer cost volume construction.
arXiv Detail & Related papers (2023-12-14T01:33:18Z) - COMEDIAN: Self-Supervised Learning and Knowledge Distillation for Action
Spotting using Transformers [1.894259749028573]
We present COMEDIAN, a novel pipeline to initialize transformers for action spotting.
Our results highlight several advantages of our pretraining pipeline, including improved performance and faster convergence compared to non-pretrained models.
arXiv Detail & Related papers (2023-09-03T20:50:53Z) - kTrans: Knowledge-Aware Transformer for Binary Code Embedding [15.361622199889263]
We propose a novel Transformer-based approach, namely kTrans, to generate knowledge-aware binary code embedding.
We inspect the generated embeddings with outlier detection and visualization, and also apply kTrans to 3 downstream tasks: Binary Code Similarity Detection (BCSD), Function Type Recovery (FTR) and Indirect Call Recognition (ICR)
Evaluation results show that kTrans can generate high-quality binary code embeddings, and outperforms state-of-the-art (SOTA) approaches on downstream tasks by 5.2%, 6.8%, and 12.6% respectively.
arXiv Detail & Related papers (2023-08-24T09:07:11Z) - PriorLane: A Prior Knowledge Enhanced Lane Detection Approach Based on
Transformer [10.55399679259444]
PriorLane is used to enhance the segmentation performance of the fully vision transformer.
PriorLane utilizes an encoder-only transformer to fuse the feature extracted by a pre-trained segmentation model with prior knowledge embeddings.
Experiments on our Zjlab dataset show that Prior-Lane outperforms SOTA lane detection methods by a 2.82% mIoU.
arXiv Detail & Related papers (2022-09-15T01:48:08Z) - Cross-Architecture Knowledge Distillation [32.689574589575244]
It is natural to distill complementary knowledge from Transformer to convolutional neural network (CNN)
To deal with this problem, a novel cross-architecture knowledge distillation method is proposed.
The proposed method outperforms 14 state-of-the-arts on both small-scale and large-scale datasets.
arXiv Detail & Related papers (2022-07-12T02:50:48Z) - MISSU: 3D Medical Image Segmentation via Self-distilling TransUNet [55.16833099336073]
We propose to self-distill a Transformer-based UNet for medical image segmentation.
It simultaneously learns global semantic information and local spatial-detailed features.
Our MISSU achieves the best performance over previous state-of-the-art methods.
arXiv Detail & Related papers (2022-06-02T07:38:53Z) - DearKD: Data-Efficient Early Knowledge Distillation for Vision
Transformers [91.6129538027725]
We propose an early knowledge distillation framework, which is termed as DearKD, to improve the data efficiency required by transformers.
Our DearKD is a two-stage framework that first distills the inductive biases from the early intermediate layers of a CNN and then gives the transformer full play by training without distillation.
arXiv Detail & Related papers (2022-04-27T15:11:04Z) - XAI for Transformers: Better Explanations through Conservative
Propagation [60.67748036747221]
We show that the gradient in a Transformer reflects the function only locally, and thus fails to reliably identify the contribution of input features to the prediction.
Our proposal can be seen as a proper extension of the well-established LRP method to Transformers.
arXiv Detail & Related papers (2022-02-15T10:47:11Z) - Pixel Distillation: A New Knowledge Distillation Scheme for Low-Resolution Image Recognition [124.80263629921498]
We propose Pixel Distillation that extends knowledge distillation into the input level while simultaneously breaking architecture constraints.
Such a scheme can achieve flexible cost control for deployment, as it allows the system to adjust both network architecture and image quality according to the overall requirement of resources.
arXiv Detail & Related papers (2021-12-17T14:31:40Z) - Efficient Vision Transformers via Fine-Grained Manifold Distillation [96.50513363752836]
Vision transformer architectures have shown extraordinary performance on many computer vision tasks.
Although the network performance is boosted, transformers are often required more computational resources.
We propose to excavate useful information from the teacher transformer through the relationship between images and the divided patches.
arXiv Detail & Related papers (2021-07-03T08:28:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.