Support-set bottlenecks for video-text representation learning
- URL: http://arxiv.org/abs/2010.02824v2
- Date: Thu, 14 Jan 2021 10:34:56 GMT
- Title: Support-set bottlenecks for video-text representation learning
- Authors: Mandela Patrick, Po-Yao Huang, Yuki Asano, Florian Metze, Alexander
Hauptmann, Jo\~ao Henriques, Andrea Vedaldi
- Abstract summary: The dominant paradigm for learning video-text representations -- noise contrastive learning -- is too strict.
We propose a novel method that alleviates this by leveraging a generative model to naturally push these related samples together.
Our proposed method outperforms others by a large margin on MSR-VTT, VATEX and ActivityNet, and MSVD for video-to-text and text-to-video retrieval.
- Score: 131.4161071785107
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The dominant paradigm for learning video-text representations -- noise
contrastive learning -- increases the similarity of the representations of
pairs of samples that are known to be related, such as text and video from the
same sample, and pushes away the representations of all other pairs. We posit
that this last behaviour is too strict, enforcing dissimilar representations
even for samples that are semantically-related -- for example, visually similar
videos or ones that share the same depicted action. In this paper, we propose a
novel method that alleviates this by leveraging a generative model to naturally
push these related samples together: each sample's caption must be
reconstructed as a weighted combination of other support samples' visual
representations. This simple idea ensures that representations are not
overly-specialized to individual samples, are reusable across the dataset, and
results in representations that explicitly encode semantics shared between
samples, unlike noise contrastive learning. Our proposed method outperforms
others by a large margin on MSR-VTT, VATEX and ActivityNet, and MSVD for
video-to-text and text-to-video retrieval.
Related papers
- Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial
Margin Contrastive Learning [35.404100473539195]
Text-video retrieval aims to rank relevant text/video higher than irrelevant ones.
Recent contrastive learning methods have shown promising results for text-video retrieval.
This paper improves contrastive learning using two novel techniques.
arXiv Detail & Related papers (2023-09-20T06:08:11Z) - UATVR: Uncertainty-Adaptive Text-Video Retrieval [90.8952122146241]
A common practice is to transfer text-video pairs to the same embedding space and craft cross-modal interactions with certain entities.
We propose an Uncertainty-language Text-Video Retrieval approach, termed UATVR, which models each look-up as a distribution matching procedure.
arXiv Detail & Related papers (2023-01-16T08:43:17Z) - Partitioning Image Representation in Contrastive Learning [0.0]
We introduce a new representation, partitioned representation, which can learn both common and unique features of the anchor and positive samples in contrastive learning.
We show that our approach can separate two types of information in the VAE framework and outperforms the conventional BYOL in linear separability and a few-shot learning task as downstream tasks.
arXiv Detail & Related papers (2022-03-20T04:55:39Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - Composable Augmentation Encoding for Video Representation Learning [94.2358972764708]
We focus on contrastive methods for self-supervised video representation learning.
A common paradigm in contrastive learning is to construct positive pairs by sampling different data views for the same instance, with different data instances as negatives.
We propose an 'augmentation aware' contrastive learning framework, where we explicitly provide a sequence of augmentation parameterisations.
We show that our method encodes valuable information about specified spatial or temporal augmentation, and in doing so also achieve state-of-the-art performance on a number of video benchmarks.
arXiv Detail & Related papers (2021-04-01T16:48:53Z) - An Unsupervised Sampling Approach for Image-Sentence Matching Using
Document-Level Structural Information [64.66785523187845]
We focus on the problem of unsupervised image-sentence matching.
Existing research explores to utilize document-level structural information to sample positive and negative instances for model training.
We propose a new sampling strategy to select additional intra-document image-sentence pairs as positive or negative samples.
arXiv Detail & Related papers (2021-03-21T05:43:29Z) - Active Contrastive Learning of Audio-Visual Video Representations [35.59750167222663]
We propose an active contrastive learning approach that builds an textitactively sampled dictionary with diverse and informative items.
Our model achieves state-of-the-art performance on challenging audio and visual downstream benchmarks.
arXiv Detail & Related papers (2020-08-31T21:18:30Z) - Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.