TeachCLIP: Multi-Grained Teaching for Efficient Text-to-Video Retrieval
- URL: http://arxiv.org/abs/2308.01217v1
- Date: Wed, 2 Aug 2023 15:22:00 GMT
- Title: TeachCLIP: Multi-Grained Teaching for Efficient Text-to-Video Retrieval
- Authors: Kaibin Tian, Ruixiang Zhao, Hu Hu, Runquan Xie, Fengzong Lian, Zhanhui
Kang and Xirong Li
- Abstract summary: We propose TeachCLIP with multi-grained teaching to let a CLIP4Clip based student network learn from more advanced yet computationally heavy models.
AFA provides a fine-grained learning (teaching) channel for the student (teacher)
- Score: 12.067700655401364
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For text-to-video retrieval (T2VR), which aims to retrieve unlabeled videos
by ad-hoc textual queries, CLIP-based methods are dominating. Compared to
CLIP4Clip which is efficient and compact, the state-of-the-art models tend to
compute video-text similarity by fine-grained cross-modal feature interaction
and matching, putting their scalability for large-scale T2VR into doubt. For
efficient T2VR, we propose TeachCLIP with multi-grained teaching to let a
CLIP4Clip based student network learn from more advanced yet computationally
heavy models such as X-CLIP, TS2-Net and X-Pool . To improve the student's
learning capability, we add an Attentional frame-Feature Aggregation (AFA)
block, which by design adds no extra storage/computation overhead at the
retrieval stage. While attentive weights produced by AFA are commonly used for
combining frame-level features, we propose a novel use of the weights to let
them imitate frame-text relevance estimated by the teacher network. As such,
AFA provides a fine-grained learning (teaching) channel for the student
(teacher). Extensive experiments on multiple public datasets justify the
viability of the proposed method.
Related papers
- VLAB: Enhancing Video Language Pre-training by Feature Adapting and
Blending [78.1399386935455]
Large-scale image-text contrastive pre-training models, such as CLIP, have been demonstrated to effectively learn high-quality multimodal representations.
We propose a novel video-text pre-training method dubbed VLAB: Video Language pre-training by feature generativearity and Blending.
VLAB transfers CLIP representations to video pre-training tasks and develops unified video multimodal models for a wide range of video-text tasks.
arXiv Detail & Related papers (2023-05-22T15:54:22Z) - Fine-tuned CLIP Models are Efficient Video Learners [54.96069171726668]
Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model.
Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos.
arXiv Detail & Related papers (2022-12-06T18:59:58Z) - CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language
Representation Alignment [146.3128011522151]
We propose a Omni Crossmodal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP.
Our approach improves the performance of CLIP on video-text retrieval by a large margin.
Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet.
arXiv Detail & Related papers (2022-09-14T05:47:02Z) - Video-Text Pre-training with Learned Regions [59.30893505895156]
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs.
We propose a module for videotext-learning, RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs.
arXiv Detail & Related papers (2021-12-02T13:06:53Z) - TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment [68.08689660963468]
A new algorithm called Token-Aware Cascade contrastive learning (TACo) improves contrastive learning using two novel techniques.
We set new state-of-the-art on three public text-video retrieval benchmarks of YouCook2, MSR-VTT and ActivityNet.
arXiv Detail & Related papers (2021-08-23T07:24:57Z) - Video Corpus Moment Retrieval with Contrastive Learning [56.249924768243375]
Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query.
We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR.
Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
arXiv Detail & Related papers (2021-05-13T12:54:39Z) - CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip
Retrieval [31.7091206926183]
The CLIP (Contrastive Language-Image Pre-training) has demonstrated the power of visual concepts learning from web collected image-text datasets.
We propose a CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner.
arXiv Detail & Related papers (2021-04-18T13:59:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.