Learn to Understand Negation in Video Retrieval
- URL: http://arxiv.org/abs/2205.00132v1
- Date: Sat, 30 Apr 2022 02:22:18 GMT
- Title: Learn to Understand Negation in Video Retrieval
- Authors: Ziyue Wang, Aozhu Chen, Fan Hu and Xirong Li
- Abstract summary: Negation is a common linguistic skill that allows human to express what we do NOT want.
Deep learning based video retrieval models are typically trained on video description datasets that lack negated descriptions.
We present the first study on learning to understand negation in video retrieval.
- Score: 9.929121517850204
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Negation is a common linguistic skill that allows human to express what we do
NOT want. Naturally, one might expect video retrieval to support
natural-language queries with negation, e.g., finding shots of kids sitting on
the floor and not playing with the dog. However, the state-of-the-art deep
learning based video retrieval models lack such ability, as they are typically
trained on video description datasets such as MSR-VTT and VATEX that lack
negated descriptions. Their retrieved results basically ignore the negator in
the sample query, incorrectly returning videos showing kids playing with the
dog. In this paper, we present the first study on learning to understand
negation in video retrieval and make contributions as follows. First, by
re-purposing two existing datasets, i.e. MSR-VTT and VATEX, we propose a new
evaluation protocol for testing video retrieval with negation. Second, we
propose a learning based method for training a negation-aware video retrieval
model. The key idea is to first construct a soft negative caption for a
specific training video by partially negating its original caption, and then
compute a bidirectionally constrained loss on the triplet. This auxiliary loss
is then weightedly added to a standard retrieval loss. Experiments on the
re-purposed benchmarks show that re-training the CLIP (Contrastive
Language-Image Pre-Training) model by the proposed method clearly improves its
ability to handle queries with negation. In addition, its performance on the
original benchmarks is also improved. Data and source code will be released.
Related papers
- Vision-Language Models Do Not Understand Negation [50.27667000027403]
NegBench is a benchmark designed to evaluate negation understanding across 18 task variations and 79k examples spanning image, video, and medical datasets.
We show that this approach can result in a 10% increase in recall on negated queries and a 40% boost in accuracy on multiple-choice questions with negated captions.
arXiv Detail & Related papers (2025-01-16T09:55:42Z) - Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial
Margin Contrastive Learning [35.404100473539195]
Text-video retrieval aims to rank relevant text/video higher than irrelevant ones.
Recent contrastive learning methods have shown promising results for text-video retrieval.
This paper improves contrastive learning using two novel techniques.
arXiv Detail & Related papers (2023-09-20T06:08:11Z) - ICSVR: Investigating Compositional and Syntactic Understanding in Video Retrieval Models [6.073813559982129]
Video retrieval involves retrieving the ground truth video from the video database given a text caption or vice-versa.
We evaluate the compositional and syntactic understanding of video retrieval models on standard benchmarks such as MSRVTT, MSVD and DIDEMO.
Our experiments reveal that actions and syntax play a minor role compared to objects & attributes in video understanding.
arXiv Detail & Related papers (2023-06-28T20:06:36Z) - Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video
Retrieval Benchmarks [6.540440003084223]
Video captioning datasets have been re-purposed to evaluate models.
Many alternate videos also match the caption, which introduces false-negative caption-video pairs.
We show that when these false negatives are corrected, a recent state-of-the-art model gains 25% recall points.
arXiv Detail & Related papers (2022-10-10T22:45:06Z) - Learning Transferable Spatiotemporal Representations from Natural Script
Knowledge [65.40899722211726]
We introduce a new pretext task, Turning to Video Transcript for ASR (TVTS), which sorts scripts by attending to learned video representations.
The advantages enable our model to contextualize what is happening like human beings and seamlessly apply to large-scale uncurated video data in the real world.
arXiv Detail & Related papers (2022-09-30T07:39:48Z) - TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment [68.08689660963468]
A new algorithm called Token-Aware Cascade contrastive learning (TACo) improves contrastive learning using two novel techniques.
We set new state-of-the-art on three public text-video retrieval benchmarks of YouCook2, MSR-VTT and ActivityNet.
arXiv Detail & Related papers (2021-08-23T07:24:57Z) - Boosting Video Captioning with Dynamic Loss Network [0.0]
This paper addresses the drawback by introducing a dynamic loss network (DLN)
Our results on Microsoft Research Video Description Corpus (MSVD) and MSR-Video to Text (MSRVTT) datasets outperform previous methods.
arXiv Detail & Related papers (2021-07-25T01:32:02Z) - CUPID: Adaptive Curation of Pre-training Data for Video-and-Language
Representation Learning [49.18591896085498]
We propose CUPID to bridge the domain gap between source and target data.
CUPID yields new state-of-the-art performance across multiple video-language and video tasks.
arXiv Detail & Related papers (2021-04-01T06:42:16Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - Learning to Discretely Compose Reasoning Module Networks for Video
Captioning [81.81394228898591]
We propose a novel visual reasoning approach for video captioning, named Reasoning Module Networks (RMN)
RMN employs three sophisticated RM-temporal reasoning, and 2) a dynamic and discrete module selector trained by a linguistic loss with a Gumbel approximation.
arXiv Detail & Related papers (2020-07-17T15:27:37Z) - Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.