Efficient Temporal Sentence Grounding in Videos with Multi-Teacher Knowledge Distillation
- URL: http://arxiv.org/abs/2308.03725v2
- Date: Wed, 24 Jul 2024 05:49:27 GMT
- Title: Efficient Temporal Sentence Grounding in Videos with Multi-Teacher Knowledge Distillation
- Authors: Renjie Liang, Yiming Yang, Hui Lu, Li Li,
- Abstract summary: Temporal Sentence Grounding in Videos (TSGV) aims to detect the event timestamps described by the natural language query from untrimmed videos.
This paper discusses the challenge of achieving efficient computation in TSGV models while maintaining high performance.
- Score: 29.952771954087602
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Temporal Sentence Grounding in Videos (TSGV) aims to detect the event timestamps described by the natural language query from untrimmed videos. This paper discusses the challenge of achieving efficient computation in TSGV models while maintaining high performance. Most existing approaches exquisitely design complex architectures to improve accuracy with extra layers and loss, suffering from inefficiency and heaviness. Although some works have noticed that, they only make an issue of feature fusion layers, which can hardly enjoy the highspeed merit in the whole clunky network. To tackle this problem, we propose a novel efficient multi-teacher model (EMTM) based on knowledge distillation to transfer diverse knowledge from both heterogeneous and isomorphic networks. Specifically, We first unify different outputs of the heterogeneous models into one single form. Next, a Knowledge Aggregation Unit (KAU) is built to acquire high-quality integrated soft labels from multiple teachers. After that, the KAU module leverages the multi-scale video and global query information to adaptively determine the weights of different teachers. A Shared Encoder strategy is then proposed to solve the problem that the student shallow layers hardly benefit from teachers, in which an isomorphic teacher is collaboratively trained with the student to align their hidden states. Extensive experimental results on three popular TSGV benchmarks demonstrate that our method is both effective and efficient without bells and whistles.
Related papers
- CFTS-GAN: Continual Few-Shot Teacher Student for Generative Adversarial Networks [0.5024983453990064]
Few-shot and continual learning face two well-known challenges in GANs: overfitting and catastrophic forgetting.
This paper proposes a Continual Few-shot Teacher-Student technique for the generative adversarial network (CFTS-GAN) that considers both challenges together.
arXiv Detail & Related papers (2024-10-17T20:49:08Z) - VideoAdviser: Video Knowledge Distillation for Multimodal Transfer
Learning [6.379202839994046]
Multimodal transfer learning aims to transform pretrained representations of diverse modalities into a common domain space for effective multimodal fusion.
We propose VideoAdviser, a video knowledge distillation method to transfer multimodal knowledge of video-enhanced prompts from a multimodal fundamental model to a specific modal fundamental model.
We evaluate our method in two challenging multimodal tasks: video-level sentiment analysis and audio-visual retrieval.
arXiv Detail & Related papers (2023-09-27T08:44:04Z) - Unmasked Teacher: Towards Training-Efficient Video Foundation Models [50.19560876891811]
Video Foundation Models (VFMs) have received limited exploration due to high computational costs and data scarcity.
This paper proposes a training-efficient method for temporal-sensitive VFMs that integrates the benefits of existing methods.
Our model can handle various tasks including scene-related, temporal-related, and complex video-language understanding.
arXiv Detail & Related papers (2023-03-28T15:39:28Z) - Masked Video Distillation: Rethinking Masked Feature Modeling for
Self-supervised Video Representation Learning [123.63301596019522]
Masked video distillation (MVD) is a simple yet effective two-stage masked feature modeling framework for video representation learning.
For the choice of teacher models, we observe that students taught by video teachers perform better on temporally-heavy video tasks.
We design a spatial-temporal co-teaching method for MVD to leverage the advantage of different teachers.
arXiv Detail & Related papers (2022-12-08T18:59:59Z) - Weakly Supervised Semantic Segmentation via Alternative Self-Dual
Teaching [82.71578668091914]
This paper establishes a compact learning framework that embeds the classification and mask-refinement components into a unified deep model.
We propose a novel alternative self-dual teaching (ASDT) mechanism to encourage high-quality knowledge interaction.
arXiv Detail & Related papers (2021-12-17T11:56:56Z) - DnS: Distill-and-Select for Efficient and Accurate Video Indexing and
Retrieval [23.42790810694723]
We propose a Knowledge Distillation framework, which we call Distill-and-Select (DnS)
We train several students with different architectures and arrive at different trade-offs of performance and efficiency.
Importantly, the proposed scheme allows Knowledge Distillation in large, unlabelled datasets -- this leads to good students.
arXiv Detail & Related papers (2021-06-24T18:34:24Z) - Distilling a Powerful Student Model via Online Knowledge Distillation [158.68873654990895]
Existing online knowledge distillation approaches either adopt the student with the best performance or construct an ensemble model for better holistic performance.
We propose a novel method for online knowledge distillation, termed FFSD, which comprises two key components: Feature Fusion and Self-Distillation.
arXiv Detail & Related papers (2021-03-26T13:54:24Z) - Differentiable Feature Aggregation Search for Knowledge Distillation [47.94874193183427]
We introduce the feature aggregation to imitate the multi-teacher distillation in the single-teacher distillation framework.
DFA is a two-stage Differentiable Feature Aggregation search method motivated by DARTS in neural architecture search.
Experimental results show that DFA outperforms existing methods on CIFAR-100 and CINIC-10 datasets.
arXiv Detail & Related papers (2020-08-02T15:42:29Z) - Efficient Crowd Counting via Structured Knowledge Transfer [122.30417437707759]
Crowd counting is an application-oriented task and its inference efficiency is crucial for real-world applications.
We propose a novel Structured Knowledge Transfer framework to generate a lightweight but still highly effective student network.
Our models obtain at least 6.5$times$ speed-up on an Nvidia 1080 GPU and even achieve state-of-the-art performance.
arXiv Detail & Related papers (2020-03-23T08:05:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.