RaP: Redundancy-aware Video-language Pre-training for Text-Video
Retrieval
- URL: http://arxiv.org/abs/2210.06881v1
- Date: Thu, 13 Oct 2022 10:11:41 GMT
- Title: RaP: Redundancy-aware Video-language Pre-training for Text-Video
Retrieval
- Authors: Xing Wu, Chaochen Gao, Zijia Lin, Zhongyuan Wang, Jizhong Han, Songlin
Hu
- Abstract summary: We propose Redundancy-aware Video-language Pre-training.
We design a redundancy measurement of video patches and text tokens by calculating the cross-modal minimum dis-similarity.
We evaluate our method on four benchmark datasets, MSRVTT, MSVD, DiDeMo, and LSMDC.
- Score: 61.77760317554826
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video language pre-training methods have mainly adopted sparse sampling
techniques to alleviate the temporal redundancy of videos. Though effective,
sparse sampling still suffers inter-modal redundancy: visual redundancy and
textual redundancy. Compared with highly generalized text, sparsely sampled
frames usually contain text-independent portions, called visual redundancy.
Sparse sampling is also likely to miss important frames corresponding to some
text portions, resulting in textual redundancy. Inter-modal redundancy leads to
a mismatch of video and text information, hindering the model from better
learning the shared semantics across modalities. To alleviate it, we propose
Redundancy-aware Video-language Pre-training. We design a redundancy
measurement of video patches and text tokens by calculating the cross-modal
minimum dis-similarity. Then, we penalize the highredundant video patches and
text tokens through a proposed redundancy-aware contrastive learning. We
evaluate our method on four benchmark datasets, MSRVTT, MSVD, DiDeMo, and
LSMDC, achieving a significant improvement over the previous stateof-the-art
results. Our code are available at
https://github.com/caskcsg/VLP/tree/main/RaP.
Related papers
- MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query.
Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity.
This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text.
In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z) - Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation [92.55296042611886]
We propose a framework called "Reuse and Diffuse" dubbed $textitVidRD$ to produce more frames following the frames already generated by an LDM.
We also propose a set of strategies for composing video-text data that involve diverse content from multiple existing datasets.
arXiv Detail & Related papers (2023-09-07T08:12:58Z) - Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models [58.17315970207874]
We propose a zero-shot method for adapting generalisable visual-textual priors from arbitrary VLM to facilitate moment-text alignment.
Experiments conducted on three VMR benchmark datasets demonstrate the notable performance advantages of our zero-shot algorithm.
arXiv Detail & Related papers (2023-09-01T13:06:50Z) - TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at
Scale [59.01246141215051]
We analyze the factor that leads to degradation from the perspective of language supervision.
We propose a tunable-free pre-training strategy to retain the generalization ability of the text encoder.
We produce a series of models, dubbed TVTSv2, with up to one billion parameters.
arXiv Detail & Related papers (2023-05-23T15:44:56Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z) - Video-Text Retrieval by Supervised Sparse Multi-Grained Learning [22.17732989393653]
We present a novel multi-grained sparse learning framework, S3MA, to learn an sparse space shared between the video and the text for video-text retrieval.
With the text data at hand, we learn and update the shared sparse space in a supervised manner using the proposed similarity and alignment losses.
Benefiting from the learned shared sparse space and multi-grained similarities, experiments on several video-text retrieval benchmarks demonstrate the superiority of S3MA over existing methods.
arXiv Detail & Related papers (2023-02-19T04:03:22Z) - Masked Contrastive Pre-Training for Efficient Video-Text Retrieval [37.05164804180039]
We present a simple yet effective end-to-end Video-language Pre-training (VidLP) framework, Masked Contrastive Video-language Pretraining (MAC)
Our MAC aims to reduce video representation's spatial and temporal redundancy in the VidLP model.
Coupling these designs enables efficient end-to-end pre-training: reduce FLOPs (60% off), accelerate pre-training (by 3x), and improve performance.
arXiv Detail & Related papers (2022-12-02T05:44:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.