Related papers: Knowledge Amalgamation for Object Detection with Transformers

Knowledge Amalgamation for Object Detection with Transformers

URL: http://arxiv.org/abs/2203.03187v1
Date: Mon, 7 Mar 2022 07:45:22 GMT
Title: Knowledge Amalgamation for Object Detection with Transformers
Authors: Haofei Zhang, Feng Mao, Mengqi Xue, Gongfan Fang, Zunlei Feng, Jie Song, Mingli Song
Abstract summary: Knowledge amalgamation (KA) is a novel deep model reusing task aiming to transfer knowledge from several well-trained teachers to a compact student. We propose to dissolve the KA into two aspects: sequence-level amalgamation (SA) and task-level amalgamation (TA) In particular, a hint is generated within the sequence-level amalgamation by concatenating teacher sequences instead of redundantly aggregating them to a fixed-size one.
Score: 36.7897364648987
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Knowledge amalgamation (KA) is a novel deep model reusing task aiming to transfer knowledge from several well-trained teachers to a multi-talented and compact student. Currently, most of these approaches are tailored for convolutional neural networks (CNNs). However, there is a tendency that transformers, with a completely different architecture, are starting to challenge the domination of CNNs in many computer vision tasks. Nevertheless, directly applying the previous KA methods to transformers leads to severe performance degradation. In this work, we explore a more effective KA scheme for transformer-based object detection models. Specifically, considering the architecture characteristics of transformers, we propose to dissolve the KA into two aspects: sequence-level amalgamation (SA) and task-level amalgamation (TA). In particular, a hint is generated within the sequence-level amalgamation by concatenating teacher sequences instead of redundantly aggregating them to a fixed-size one as previous KA works. Besides, the student learns heterogeneous detection tasks through soft targets with efficiency in the task-level amalgamation. Extensive experiments on PASCAL VOC and COCO have unfolded that the sequence-level amalgamation significantly boosts the performance of students, while the previous methods impair the students. Moreover, the transformer-based students excel in learning amalgamated knowledge, as they have mastered heterogeneous detection tasks rapidly and achieved superior or at least comparable performance to those of the teachers in their specializations.

Related papers

Small transformer architectures for task switching [2.7195102129095003]
It is non-trivial to conceive small-scale applications for which attention-based architectures outperform traditional approaches.<n>We show that standard transformers cannot solve a basic task switching reference model.<n>We show that transformers, long short-term memory recurrent networks (LSTM), and plain multi-layer perceptrons (MLPs) achieve similar, but only modest prediction accuracies.
arXiv Detail & Related papers (2025-08-06T14:01:05Z)
Low-latency vision transformers via large-scale multi-head attention [0.0]
A learning mechanism is generalized to large-scale MHA (LS-MHA) using a single matrix value representing single-head performance.<n>Several distinct vision transformer (ViT) architectures achieve the same accuracy but differ in their LS-MHA structures.<n>The extension of this learning mechanism to natural language processing tasks has the potential to yield new insights in deep learning.
arXiv Detail & Related papers (2025-06-30T13:23:46Z)
How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias [48.9399496805422]
We focus on two representative tasks in the category of regular language recognition, known as even pairs' and parity check'<n>Our goal is to explore how a one-layer transformer, consisting of an attention layer followed by a linear layer, learns to solve these tasks.
arXiv Detail & Related papers (2025-05-02T00:07:35Z)
Transforming Vision Transformer: Towards Efficient Multi-Task Asynchronous Learning [59.001091197106085]
Multi-Task Learning (MTL) for Vision Transformer aims at enhancing the model capability by tackling multiple tasks simultaneously. Most recent works have predominantly focused on designing Mixture-of-Experts (MoE) structures and in tegrating Low-Rank Adaptation (LoRA) to efficiently perform multi-task learning. We propose a novel approach dubbed Efficient Multi-Task Learning (EMTAL) by transforming a pre-trained Vision Transformer into an efficient multi-task learner.
arXiv Detail & Related papers (2025-01-12T17:41:23Z)
Weight Copy and Low-Rank Adaptation for Few-Shot Distillation of Vision Transformers [22.1372572833618]
We propose a novel few-shot feature distillation approach for vision transformers. We first copy the weights from intermittent layers of existing vision transformers into shallower architectures (students) Next, we employ an enhanced version of Low-Rank Adaptation (LoRA) to distill knowledge into the student in a few-shot scenario.
arXiv Detail & Related papers (2024-04-14T18:57:38Z)
Remembering Transformer for Continual Learning [9.879896956915598]
We propose Remembering Transformer, inspired by the brain's Complementary Learning Systems. Remembering Transformer employs a mixture-of-adapters architecture and a generative model-based novelty detection mechanism. We conducted extensive experiments, including ablation studies on the novelty detection mechanism and model capacity of the mixture-of-adapters.
arXiv Detail & Related papers (2024-04-11T07:22:14Z)
Associative Transformer [26.967506484952214]
We propose Associative Transformer (AiT) to enhance the association among sparsely attended input patches. AiT requires significantly fewer parameters and attention layers while outperforming Vision Transformers and a broad range of sparse Transformers.
arXiv Detail & Related papers (2023-09-22T13:37:10Z)
MASTER: Multi-task Pre-trained Bottlenecked Masked Autoencoders are Better Dense Retrievers [140.0479479231558]
In this work, we aim to unify a variety of pre-training tasks into a multi-task pre-trained model, namely MASTER. MASTER utilizes a shared-encoder multi-decoder architecture that can construct a representation bottleneck to compress the abundant semantic information across tasks into dense vectors.
arXiv Detail & Related papers (2022-12-15T13:57:07Z)
A Neural ODE Interpretation of Transformer Layers [8.839601328192957]
Transformer layers, which use an alternating pattern of multi-head attention and multi-layer perceptron (MLP) layers, provide an effective tool for a variety of machine learning problems. We build upon this connection and propose a modification of the internal architecture of a transformer layer. Our experiments show that this simple modification improves the performance of transformer networks in multiple tasks.
arXiv Detail & Related papers (2022-12-12T16:18:58Z)
E2-AEN: End-to-End Incremental Learning with Adaptively Expandable Network [57.87240860624937]
We propose an end-to-end trainable adaptively expandable network named E2-AEN. It dynamically generates lightweight structures for new tasks without any accuracy drop in previous tasks. E2-AEN reduces cost and can be built upon any feed-forward architectures in an end-to-end manner.
arXiv Detail & Related papers (2022-07-14T09:04:51Z)
Continual Object Detection via Prototypical Task Correlation Guided Gating Mechanism [120.1998866178014]
We present a flexible framework for continual object detection via pRotOtypical taSk corrElaTion guided gaTingAnism (ROSETTA) Concretely, a unified framework is shared by all tasks while task-aware gates are introduced to automatically select sub-models for specific tasks. Experiments on COCO-VOC, KITTI-Kitchen, class-incremental detection on VOC and sequential learning of four tasks show that ROSETTA yields state-of-the-art performance.
arXiv Detail & Related papers (2022-05-06T07:31:28Z)
CvT-ASSD: Convolutional vision-Transformer Based Attentive Single Shot MultiBox Detector [15.656374849760734]
We present a novel object detection architecture, named Convolutional vision Transformer Based Attentive Single Shot MultiBox Detector (CvT-ASSD) Our model CvT-ASSD can leads to good system efficiency and performance while being pretrained on large-scale detection datasets such as PASCAL VOC and MS COCO.
arXiv Detail & Related papers (2021-10-24T06:45:33Z)
Distilling Knowledge via Knowledge Review [69.15050871776552]
We study the factor of connection path cross levels between teacher and student networks, and reveal its great importance. For the first time in knowledge distillation, cross-stage connection paths are proposed. Our finally designed nested and compact framework requires negligible overhead, and outperforms other methods on a variety of tasks.
arXiv Detail & Related papers (2021-04-19T04:36:24Z)
UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling with Transformers [108.92194081987967]
We make the first attempt to explore a universal multi-agent reinforcement learning pipeline, designing one single architecture to fit tasks. Unlike previous RNN-based models, we utilize a transformer-based model to generate a flexible policy. The proposed model, named as Universal Policy Decoupling Transformer (UPDeT), further relaxes the action restriction and makes the multi-agent task's decision process more explainable.
arXiv Detail & Related papers (2021-01-20T07:24:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.