TransKD: Transformer Knowledge Distillation for Efficient Semantic
Segmentation
- URL: http://arxiv.org/abs/2202.13393v3
- Date: Sun, 24 Dec 2023 07:59:29 GMT
- Title: TransKD: Transformer Knowledge Distillation for Efficient Semantic
Segmentation
- Authors: Ruiping Liu, Kailun Yang, Alina Roitberg, Jiaming Zhang, Kunyu Peng,
Huayao Liu, Yaonan Wang, Rainer Stiefelhagen
- Abstract summary: Transformer-based Knowledge Distillation (TransKD) framework learns compact student transformers by distilling both feature maps and patch embeddings of large teacher transformers.
Experiments on Cityscapes, ACDC, NYUv2, and Pascal VOC2012 datasets show that TransKD outperforms state-of-the-art distillation frameworks and rivals the time-consuming pre-training method.
- Score: 51.93878604106518
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Semantic segmentation benchmarks in the realm of autonomous driving are
dominated by large pre-trained transformers, yet their widespread adoption is
impeded by substantial computational costs and prolonged training durations. To
lift this constraint, we look at efficient semantic segmentation from a
perspective of comprehensive knowledge distillation and consider to bridge the
gap between multi-source knowledge extractions and transformer-specific patch
embeddings. We put forward the Transformer-based Knowledge Distillation
(TransKD) framework which learns compact student transformers by distilling
both feature maps and patch embeddings of large teacher transformers, bypassing
the long pre-training process and reducing the FLOPs by >85.0%. Specifically,
we propose two fundamental and two optimization modules: (1) Cross Selective
Fusion (CSF) enables knowledge transfer between cross-stage features via
channel attention and feature map distillation within hierarchical
transformers; (2) Patch Embedding Alignment (PEA) performs dimensional
transformation within the patchifying process to facilitate the patch embedding
distillation; (3) Global-Local Context Mixer (GL-Mixer) extracts both global
and local information of a representative embedding; (4) Embedding Assistant
(EA) acts as an embedding method to seamlessly bridge teacher and student
models with the teacher's number of channels. Experiments on Cityscapes, ACDC,
NYUv2, and Pascal VOC2012 datasets show that TransKD outperforms
state-of-the-art distillation frameworks and rivals the time-consuming
pre-training method. The source code is publicly available at
https://github.com/RuipingL/TransKD.
Related papers
- CLoCKDistill: Consistent Location-and-Context-aware Knowledge Distillation for DETRs [2.7624021966289605]
This paper proposes Consistent Location-and-Context-aware Knowledge Distillation (CLoCKDistill) for DETR detectors.
We distill the transformer encoder output (i.e., memory) that contains valuable global context and long-range dependencies.
Our method boosts student detector performance by 2.2% to 6.4%.
arXiv Detail & Related papers (2025-02-15T06:02:51Z) - BEExformer: A Fast Inferencing Binarized Transformer with Early Exits [2.7651063843287718]
We introduce Binarized Early Exit Transformer (BEExformer), the first-ever selective learning-based transformer integrating Binarization-Aware Training (BAT) with Early Exit (EE)<n>BAT employs a differentiable second-order approximation to the sign function, enabling gradient that captures both the sign and magnitude of the weights.<n>EE mechanism hinges on fractional reduction in entropy among intermediate transformer blocks with soft-routing loss estimation.<n>This accelerates inference by reducing FLOPs by 52.08% and even improves accuracy by 2.89% by resolving the "overthinking" problem inherent in deep networks.
arXiv Detail & Related papers (2024-12-06T17:58:14Z) - Weight Copy and Low-Rank Adaptation for Few-Shot Distillation of Vision Transformers [22.1372572833618]
We propose a novel few-shot feature distillation approach for vision transformers.
We first copy the weights from intermittent layers of existing vision transformers into shallower architectures (students)
Next, we employ an enhanced version of Low-Rank Adaptation (LoRA) to distill knowledge into the student in a few-shot scenario.
arXiv Detail & Related papers (2024-04-14T18:57:38Z) - COMEDIAN: Self-Supervised Learning and Knowledge Distillation for Action
Spotting using Transformers [1.894259749028573]
We present COMEDIAN, a novel pipeline to initialize transformers for action spotting.
Our results highlight several advantages of our pretraining pipeline, including improved performance and faster convergence compared to non-pretrained models.
arXiv Detail & Related papers (2023-09-03T20:50:53Z) - ERNIE-Search: Bridging Cross-Encoder with Dual-Encoder via Self
On-the-fly Distillation for Dense Passage Retrieval [54.54667085792404]
We propose a novel distillation method that significantly advances cross-architecture distillation for dual-encoders.
Our method 1) introduces a self on-the-fly distillation method that can effectively distill late interaction (i.e., ColBERT) to vanilla dual-encoder, and 2) incorporates a cascade distillation process to further improve the performance with a cross-encoder teacher.
arXiv Detail & Related papers (2022-05-18T18:05:13Z) - DearKD: Data-Efficient Early Knowledge Distillation for Vision
Transformers [91.6129538027725]
We propose an early knowledge distillation framework, which is termed as DearKD, to improve the data efficiency required by transformers.
Our DearKD is a two-stage framework that first distills the inductive biases from the early intermediate layers of a CNN and then gives the transformer full play by training without distillation.
arXiv Detail & Related papers (2022-04-27T15:11:04Z) - XAI for Transformers: Better Explanations through Conservative
Propagation [60.67748036747221]
We show that the gradient in a Transformer reflects the function only locally, and thus fails to reliably identify the contribution of input features to the prediction.
Our proposal can be seen as a proper extension of the well-established LRP method to Transformers.
arXiv Detail & Related papers (2022-02-15T10:47:11Z) - Pixel Distillation: A New Knowledge Distillation Scheme for Low-Resolution Image Recognition [124.80263629921498]
We propose Pixel Distillation that extends knowledge distillation into the input level while simultaneously breaking architecture constraints.
Such a scheme can achieve flexible cost control for deployment, as it allows the system to adjust both network architecture and image quality according to the overall requirement of resources.
arXiv Detail & Related papers (2021-12-17T14:31:40Z) - Efficient Vision Transformers via Fine-Grained Manifold Distillation [96.50513363752836]
Vision transformer architectures have shown extraordinary performance on many computer vision tasks.
Although the network performance is boosted, transformers are often required more computational resources.
We propose to excavate useful information from the teacher transformer through the relationship between images and the divided patches.
arXiv Detail & Related papers (2021-07-03T08:28:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.