TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment
- URL: http://arxiv.org/abs/2108.09980v1
- Date: Mon, 23 Aug 2021 07:24:57 GMT
- Title: TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment
- Authors: Jianwei Yang, Yonatan Bisk, Jianfeng Gao
- Abstract summary: A new algorithm called Token-Aware Cascade contrastive learning (TACo) improves contrastive learning using two novel techniques.
We set new state-of-the-art on three public text-video retrieval benchmarks of YouCook2, MSR-VTT and ActivityNet.
- Score: 68.08689660963468
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive learning has been widely used to train transformer-based
vision-language models for video-text alignment and multi-modal representation
learning. This paper presents a new algorithm called Token-Aware Cascade
contrastive learning (TACo) that improves contrastive learning using two novel
techniques. The first is the token-aware contrastive loss which is computed by
taking into account the syntactic classes of words. This is motivated by the
observation that for a video-text pair, the content words in the text, such as
nouns and verbs, are more likely to be aligned with the visual contents in the
video than the function words. Second, a cascade sampling method is applied to
generate a small set of hard negative examples for efficient loss estimation
for multi-modal fusion layers. To validate the effectiveness of TACo, in our
experiments we finetune pretrained models for a set of downstream tasks
including text-video retrieval (YouCook2, MSR-VTT and ActivityNet), video
action step localization (CrossTask), video action segmentation (COIN). The
results show that our models attain consistent improvements across different
experimental settings over previous methods, setting new state-of-the-art on
three public text-video retrieval benchmarks of YouCook2, MSR-VTT and
ActivityNet.
Related papers
- Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial
Margin Contrastive Learning [35.404100473539195]
Text-video retrieval aims to rank relevant text/video higher than irrelevant ones.
Recent contrastive learning methods have shown promising results for text-video retrieval.
This paper improves contrastive learning using two novel techniques.
arXiv Detail & Related papers (2023-09-20T06:08:11Z) - Contrastive Graph Multimodal Model for Text Classification in Videos [9.218562155255233]
We are the first to address this new task of video text classification by fusing multimodal information.
We tailor a specific module called CorrelationNet to reinforce feature representation by explicitly extracting layout information.
We construct a new well-defined industrial dataset from the news domain, called TI-News, which is dedicated to building and evaluating video text recognition and classification applications.
arXiv Detail & Related papers (2022-06-06T04:06:21Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for
Cross-Modal Retrieval [59.15034487974549]
We propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval.
Our COTS achieves the highest performance among all two-stream methods and comparable performance with 10,800X faster in inference.
Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-ofthe-art on the widely-used MSR-VTT dataset.
arXiv Detail & Related papers (2022-04-15T12:34:47Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - CLIP4Caption: CLIP for Video Caption [9.470254059503862]
We propose a CLIP4Caption framework that improves video captioning based on a CLIP-enhanced video-text matching network (VTM)
This framework is taking full advantage of the information from both vision and language and enforcing the model to learn strongly text-correlated video features for text generation.
arXiv Detail & Related papers (2021-10-13T10:17:06Z) - Video-aided Unsupervised Grammar Induction [108.53765268059425]
We investigate video-aided grammar induction, which learns a constituency from both unlabeled text and its corresponding video.
Video provides even richer information, including not only static objects but also actions and state changes useful for inducing verb phrases.
We propose a Multi-Modal Compound PCFG model (MMC-PCFG) to effectively aggregate these rich features from different modalities.
arXiv Detail & Related papers (2021-04-09T14:01:36Z) - Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.