TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval
- URL: http://arxiv.org/abs/2104.08271v1
- Date: Fri, 16 Apr 2021 17:55:28 GMT
- Title: TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval
- Authors: Ioana Croitoru, Simion-Vlad Bogolin, Yang Liu, Samuel Albanie, Marius
Leordeanu, Hailin Jin, Andrew Zisserman
- Abstract summary: We propose a novel generalized distillation method, TeachText, for exploiting large-scale language pretraining.
We extend our method to video side modalities and show that we can effectively reduce the number of used modalities at test time.
Our approach advances the state of the art on several video retrieval benchmarks by a significant margin and adds no computational overhead at test time.
- Score: 103.85002875155551
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, considerable progress on the task of text-video retrieval
has been achieved by leveraging large-scale pretraining on visual and audio
datasets to construct powerful video encoders. By contrast, despite the natural
symmetry, the design of effective algorithms for exploiting large-scale
language pretraining remains under-explored. In this work, we are the first to
investigate the design of such algorithms and propose a novel generalized
distillation method, TeachText, which leverages complementary cues from
multiple text encoders to provide an enhanced supervisory signal to the
retrieval model. Moreover, we extend our method to video side modalities and
show that we can effectively reduce the number of used modalities at test time
without compromising performance. Our approach advances the state of the art on
several video retrieval benchmarks by a significant margin and adds no
computational overhead at test time. Last but not least, we show an effective
application of our method for eliminating noise from retrieval datasets. Code
and data can be found at https://www.robots.ox.ac.uk/~vgg/research/teachtext/.
Related papers
- DeTeCtive: Detecting AI-generated Text via Multi-Level Contrastive Learning [24.99797253885887]
We argue that the key to accomplishing this task lies in distinguishing writing styles of different authors.
We propose DeTeCtive, a multi-task auxiliary, multi-level contrastive learning framework.
Our method is compatible with a range of text encoders.
arXiv Detail & Related papers (2024-10-28T12:34:49Z) - CLIP-VAD: Exploiting Vision-Language Models for Voice Activity Detection [2.110168344647122]
Voice Activity Detection (VAD) is the process of automatically determining whether a person is speaking and identifying the timing of their speech.
We introduce a novel approach leveraging Contrastive Language-Image Pretraining (CLIP) models.
Our approach outperforms several audio-visual methods despite its simplicity, and without requiring pre-training on extensive audio-visual datasets.
arXiv Detail & Related papers (2024-10-18T14:43:34Z) - GoMatching: A Simple Baseline for Video Text Spotting via Long and Short Term Matching [77.0306273129475]
Video text spotting presents an augmented challenge with the inclusion of tracking.
GoMatching focuses the training efforts on tracking while maintaining strong recognition performance.
GoMatching delivers new records on ICDAR15-video, DSText, BOVText, and our proposed novel test with arbitrary-shaped text termed ArTVideo.
arXiv Detail & Related papers (2024-01-13T13:59:15Z) - Towards Efficient and Effective Text-to-Video Retrieval with
Coarse-to-Fine Visual Representation Learning [15.998149438353133]
We propose a two-stage retrieval architecture for text-to-video retrieval.
In training phase, we design a parameter-free text-gated interaction block (TIB) for fine-grained video representation learning.
In retrieval phase, we use coarse-grained video representations for fast recall of top-k candidates, which are then reranked by fine-grained video representations.
arXiv Detail & Related papers (2024-01-01T08:54:18Z) - Enhancing Diffusion Models with Text-Encoder Reinforcement Learning [63.41513909279474]
Text-to-image diffusion models are typically trained to optimize the log-likelihood objective.
Recent research addresses this issue by refining the diffusion U-Net using human rewards through reinforcement learning or direct backpropagation.
We demonstrate that by finetuning the text encoder through reinforcement learning, we can enhance the text-image alignment of the results.
arXiv Detail & Related papers (2023-11-27T09:39:45Z) - TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at
Scale [59.01246141215051]
We analyze the factor that leads to degradation from the perspective of language supervision.
We propose a tunable-free pre-training strategy to retain the generalization ability of the text encoder.
We produce a series of models, dubbed TVTSv2, with up to one billion parameters.
arXiv Detail & Related papers (2023-05-23T15:44:56Z) - Tram: A Token-level Retrieval-augmented Mechanism for Source Code Summarization [76.57699934689468]
We propose a fine-grained Token-level retrieval-augmented mechanism (Tram) on the decoder side to enhance the performance of neural models.
To overcome the challenge of token-level retrieval in capturing contextual code semantics, we also propose integrating code semantics into individual summary tokens.
arXiv Detail & Related papers (2023-05-18T16:02:04Z) - A Feature-space Multimodal Data Augmentation Technique for Text-video
Retrieval [16.548016892117083]
Text-video retrieval methods have received increased attention over the past few years.
Data augmentation techniques were introduced to increase the performance on unseen test examples.
We propose a multimodal data augmentation technique which works in the feature space and creates new videos and captions by mixing semantically similar samples.
arXiv Detail & Related papers (2022-08-03T14:05:20Z) - TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment [68.08689660963468]
A new algorithm called Token-Aware Cascade contrastive learning (TACo) improves contrastive learning using two novel techniques.
We set new state-of-the-art on three public text-video retrieval benchmarks of YouCook2, MSR-VTT and ActivityNet.
arXiv Detail & Related papers (2021-08-23T07:24:57Z) - Text2Video: Text-driven Talking-head Video Synthesis with Phonetic
Dictionary [10.590649169151055]
We present a novel approach to synthesize video from the text.
The method builds a phoneme-pose dictionary and trains a generative adversarial network (GAN) to generate video.
Compared to audio-driven video generation algorithms, our approach has a number of advantages.
arXiv Detail & Related papers (2021-04-29T19:54:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.