Related papers: TransKD: Transformer Knowledge Distillation for Efficient Semantic Segmentation

TransKD: Transformer Knowledge Distillation for Efficient Semantic Segmentation

URL: http://arxiv.org/abs/2202.13393v4
Date: Thu, 5 Sep 2024 00:18:40 GMT
Title: TransKD: Transformer Knowledge Distillation for Efficient Semantic Segmentation
Authors: Ruiping Liu, Kailun Yang, Alina Roitberg, Jiaming Zhang, Kunyu Peng, Huayao Liu, Yaonan Wang, Rainer Stiefelhagen,
Abstract summary: Transformer-based Knowledge Distillation (TransKD) framework learns compact student transformers by distilling both feature maps and patch embeddings of large teacher transformers. Experiments on Cityscapes, ACDC, NYUv2, and Pascal VOC2012 datasets show that TransKD outperforms state-of-the-art distillation frameworks.
Score: 49.794142076551026
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Semantic segmentation benchmarks in the realm of autonomous driving are dominated by large pre-trained transformers, yet their widespread adoption is impeded by substantial computational costs and prolonged training durations. To lift this constraint, we look at efficient semantic segmentation from a perspective of comprehensive knowledge distillation and aim to bridge the gap between multi-source knowledge extractions and transformer-specific patch embeddings. We put forward the Transformer-based Knowledge Distillation (TransKD) framework which learns compact student transformers by distilling both feature maps and patch embeddings of large teacher transformers, bypassing the long pre-training process and reducing the FLOPs by >85.0%. Specifically, we propose two fundamental modules to realize feature map distillation and patch embedding distillation, respectively: (1) Cross Selective Fusion (CSF) enables knowledge transfer between cross-stage features via channel attention and feature map distillation within hierarchical transformers; (2) Patch Embedding Alignment (PEA) performs dimensional transformation within the patchifying process to facilitate the patch embedding distillation. Furthermore, we introduce two optimization modules to enhance the patch embedding distillation from different perspectives: (1) Global-Local Context Mixer (GL-Mixer) extracts both global and local information of a representative embedding; (2) Embedding Assistant (EA) acts as an embedding method to seamlessly bridge teacher and student models with the teacher's number of channels. Experiments on Cityscapes, ACDC, NYUv2, and Pascal VOC2012 datasets show that TransKD outperforms state-of-the-art distillation frameworks and rivals the time-consuming pre-training method. The source code is publicly available at https://github.com/RuipingL/TransKD.

Related papers

CLoCKDistill: Consistent Location-and-Context-aware Knowledge Distillation for DETRs [2.7624021966289605]
This paper proposes Consistent Location-and-Context-aware Knowledge Distillation (CLoCKDistill) for DETR detectors. We distill the transformer encoder output (i.e., memory) that contains valuable global context and long-range dependencies. Our method boosts student detector performance by 2.2% to 6.4%.
arXiv Detail & Related papers (2025-02-15T06:02:51Z)
BEExformer: A Fast Inferencing Binarized Transformer with Early Exits [2.7651063843287718]
We introduce Binarized Early Exit Transformer (BEExformer), the first-ever selective learning-based transformer integrating Binarization-Aware Training (BAT) with Early Exit (EE)<n>BAT employs a differentiable second-order approximation to the sign function, enabling gradient that captures both the sign and magnitude of the weights.<n>EE mechanism hinges on fractional reduction in entropy among intermediate transformer blocks with soft-routing loss estimation.<n>This accelerates inference by reducing FLOPs by 52.08% and even improves accuracy by 2.89% by resolving the "overthinking" problem inherent in deep networks.
arXiv Detail & Related papers (2024-12-06T17:58:14Z)
Weight Copy and Low-Rank Adaptation for Few-Shot Distillation of Vision Transformers [22.1372572833618]
We propose a novel few-shot feature distillation approach for vision transformers. We first copy the weights from intermittent layers of existing vision transformers into shallower architectures (students) Next, we employ an enhanced version of Low-Rank Adaptation (LoRA) to distill knowledge into the student in a few-shot scenario.
arXiv Detail & Related papers (2024-04-14T18:57:38Z)
COMEDIAN: Self-Supervised Learning and Knowledge Distillation for Action Spotting using Transformers [1.894259749028573]
We present COMEDIAN, a novel pipeline to initialize transformers for action spotting. Our results highlight several advantages of our pretraining pipeline, including improved performance and faster convergence compared to non-pretrained models.
arXiv Detail & Related papers (2023-09-03T20:50:53Z)
ERNIE-Search: Bridging Cross-Encoder with Dual-Encoder via Self On-the-fly Distillation for Dense Passage Retrieval [54.54667085792404]
We propose a novel distillation method that significantly advances cross-architecture distillation for dual-encoders. Our method 1) introduces a self on-the-fly distillation method that can effectively distill late interaction (i.e., ColBERT) to vanilla dual-encoder, and 2) incorporates a cascade distillation process to further improve the performance with a cross-encoder teacher.
arXiv Detail & Related papers (2022-05-18T18:05:13Z)
DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers [91.6129538027725]
We propose an early knowledge distillation framework, which is termed as DearKD, to improve the data efficiency required by transformers. Our DearKD is a two-stage framework that first distills the inductive biases from the early intermediate layers of a CNN and then gives the transformer full play by training without distillation.
arXiv Detail & Related papers (2022-04-27T15:11:04Z)
XAI for Transformers: Better Explanations through Conservative Propagation [60.67748036747221]
We show that the gradient in a Transformer reflects the function only locally, and thus fails to reliably identify the contribution of input features to the prediction. Our proposal can be seen as a proper extension of the well-established LRP method to Transformers.
arXiv Detail & Related papers (2022-02-15T10:47:11Z)
Pixel Distillation: A New Knowledge Distillation Scheme for Low-Resolution Image Recognition [124.80263629921498]
We propose Pixel Distillation that extends knowledge distillation into the input level while simultaneously breaking architecture constraints. Such a scheme can achieve flexible cost control for deployment, as it allows the system to adjust both network architecture and image quality according to the overall requirement of resources.
arXiv Detail & Related papers (2021-12-17T14:31:40Z)
Efficient Vision Transformers via Fine-Grained Manifold Distillation [96.50513363752836]
Vision transformer architectures have shown extraordinary performance on many computer vision tasks. Although the network performance is boosted, transformers are often required more computational resources. We propose to excavate useful information from the teacher transformer through the relationship between images and the divided patches.
arXiv Detail & Related papers (2021-07-03T08:28:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.