Normalized Contrastive Learning for Text-Video Retrieval
- URL: http://arxiv.org/abs/2212.11790v1
- Date: Wed, 30 Nov 2022 19:20:29 GMT
- Title: Normalized Contrastive Learning for Text-Video Retrieval
- Authors: Yookoon Park, Mahmoud Azab, Bo Xiong, Seungwhan Moon, Florian Metze,
Gourab Kundu, Kirmani Ahmed
- Abstract summary: Cross-modal contrastive learning has led the recent advances in multimodal retrieval with its simplicity and effectiveness.
We show that cross-modal contrastive learning suffers from incorrect normalization of the sum retrieval probabilities of each text or video instance.
We propose Normalized Contrastive Learning which computes the instance-wise biases that properly normalize the sum retrieval probabilities of each instance.
- Score: 40.56493140306364
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cross-modal contrastive learning has led the recent advances in multimodal
retrieval with its simplicity and effectiveness. In this work, however, we
reveal that cross-modal contrastive learning suffers from incorrect
normalization of the sum retrieval probabilities of each text or video
instance. Specifically, we show that many test instances are either over- or
under-represented during retrieval, significantly hurting the retrieval
performance. To address this problem, we propose Normalized Contrastive
Learning (NCL) which utilizes the Sinkhorn-Knopp algorithm to compute the
instance-wise biases that properly normalize the sum retrieval probabilities of
each instance so that every text and video instance is fairly represented
during cross-modal retrieval. Empirical study shows that NCL brings consistent
and significant gains in text-video retrieval on different model architectures,
with new state-of-the-art multimodal retrieval metrics on the ActivityNet,
MSVD, and MSR-VTT datasets without any architecture engineering.
Related papers
- Generalized Contrastive Learning for Universal Multimodal Retrieval [53.70202081784898]
Cross-modal retrieval models (e.g., CLIP) show degraded performances with retrieving keys composed of fused image-text modality.<n>This paper proposes Generalized Contrastive Learning (GCL), a novel loss formulation that improves multimodal retrieval performance without the need for new dataset curation.
arXiv Detail & Related papers (2025-09-30T01:25:04Z) - IDMR: Towards Instance-Driven Precise Visual Correspondence in Multimodal Retrieval [29.05476868272228]
Instance-Driven Multimodal Image Retrieval (IDMR) is a novel task that requires models to retrieve images containing the same instance as a query image while matching a text-described scenario.
To benchmark this capability, we develop IDMR-bench using real-world object tracking and first-person video data.
Our Multimodal Large Language Model (MLLM) based retrieval model, trained on 1.2M samples, outperforms state-of-the-art approaches on both traditional benchmarks and our zero-shot IDMR-bench.
arXiv Detail & Related papers (2025-04-01T16:47:20Z) - Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.
We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.
We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs [78.5013630951288]
This paper introduces techniques for advancing information retrieval with multimodal large language models (MLLMs)
We first study fine-tuning an MLLM as a bi-encoder retriever on 10 datasets with 16 retrieval tasks.
We propose modality-aware hard negative mining to mitigate the modality bias exhibited by MLLM retrievers.
arXiv Detail & Related papers (2024-11-04T20:06:34Z) - The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval [36.516226519328015]
Video-language tasks necessitate spatial and temporal comprehension and require significant compute.
This work demonstrates the surprising effectiveness of leveraging image-text pretrained MLLMs for moment retrieval.
We achieve a new state-of-the-art in moment retrieval on the widely used benchmarks Charades-STA, QVHighlights, and ActivityNet Captions.
arXiv Detail & Related papers (2024-06-26T06:59:09Z) - Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models [58.17315970207874]
We propose a zero-shot method for adapting generalisable visual-textual priors from arbitrary VLM to facilitate moment-text alignment.
Experiments conducted on three VMR benchmark datasets demonstrate the notable performance advantages of our zero-shot algorithm.
arXiv Detail & Related papers (2023-09-01T13:06:50Z) - UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval
and Highlight Detection [46.25856560381347]
We present the first unified framework, named Unified Multi-modal Transformers (UMT)
UMT is capable of realizing such joint optimization while can also be easily degenerated for solving individual problems.
As far as we are aware, this is the first scheme to integrate multi-modal (visual-audio) learning for either joint optimization or the individual moment retrieval task.
arXiv Detail & Related papers (2022-03-23T22:11:43Z) - Single-Modal Entropy based Active Learning for Visual Question Answering [75.1682163844354]
We address Active Learning in the multi-modal setting of Visual Question Answering (VQA)
In light of the multi-modal inputs, image and question, we propose a novel method for effective sample acquisition.
Our novel idea is simple to implement, cost-efficient, and readily adaptable to other multi-modal tasks.
arXiv Detail & Related papers (2021-10-21T05:38:45Z) - See, Hear, Read: Leveraging Multimodality with Guided Attention for
Abstractive Text Summarization [14.881597737762316]
We introduce the first large-scale dataset for abstractive text summarization with videos of diverse duration, compiled from presentations in well-known academic conferences like NDSS, ICML, NeurIPS, etc.
We then propose name, a factorized multi-modal Transformer based decoder-only language model, which inherently captures the intra-modal and inter-modal dynamics within various input modalities for the text summarization task.
arXiv Detail & Related papers (2021-05-20T08:56:33Z) - TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval [103.85002875155551]
We propose a novel generalized distillation method, TeachText, for exploiting large-scale language pretraining.
We extend our method to video side modalities and show that we can effectively reduce the number of used modalities at test time.
Our approach advances the state of the art on several video retrieval benchmarks by a significant margin and adds no computational overhead at test time.
arXiv Detail & Related papers (2021-04-16T17:55:28Z) - Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.