Deconfounded Video Moment Retrieval with Causal Intervention
- URL: http://arxiv.org/abs/2106.01534v1
- Date: Thu, 3 Jun 2021 01:33:26 GMT
- Title: Deconfounded Video Moment Retrieval with Causal Intervention
- Authors: Xun Yang, Fuli Feng, Wei Ji, Meng Wang, Tat-Seng Chua
- Abstract summary: We tackle the task of video moment retrieval (VMR), which aims to localize a specific moment in a video according to a textual query.
Existing methods primarily model the matching relationship between query and moment by complex cross-modal interactions.
We propose a causality-inspired VMR framework that builds structural causal model to capture the true effect of query and video content on the prediction.
- Score: 80.90604360072831
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We tackle the task of video moment retrieval (VMR), which aims to localize a
specific moment in a video according to a textual query. Existing methods
primarily model the matching relationship between query and moment by complex
cross-modal interactions. Despite their effectiveness, current models mostly
exploit dataset biases while ignoring the video content, thus leading to poor
generalizability. We argue that the issue is caused by the hidden confounder in
VMR, {i.e., temporal location of moments}, that spuriously correlates the model
input and prediction. How to design robust matching models against the temporal
location biases is crucial but, as far as we know, has not been studied yet for
VMR.
To fill the research gap, we propose a causality-inspired VMR framework that
builds structural causal model to capture the true effect of query and video
content on the prediction. Specifically, we develop a Deconfounded Cross-modal
Matching (DCM) method to remove the confounding effects of moment location. It
first disentangles moment representation to infer the core feature of visual
content, and then applies causal intervention on the disentangled multimodal
input based on backdoor adjustment, which forces the model to fairly
incorporate each possible location of the target into consideration. Extensive
experiments clearly show that our approach can achieve significant improvement
over the state-of-the-art methods in terms of both accuracy and generalization
(Codes:
\color{blue}{\url{https://github.com/Xun-Yang/Causal_Video_Moment_Retrieval}}
Related papers
- Does SpatioTemporal information benefit Two video summarization benchmarks? [2.8558008379151882]
We ask if similar spurious relationships might influence the task of video summarization.
We first estimate a baseline with temporally invariant models to see how well such models rank on benchmark datasets.
We then disrupt the temporal order of the videos to investigate the impact it has on existing state-of-the-art models.
arXiv Detail & Related papers (2024-10-04T11:20:04Z) - Towards Debiasing Frame Length Bias in Text-Video Retrieval via Causal
Intervention [72.12974259966592]
We present a unique and systematic study of a temporal bias due to frame length discrepancy between training and test sets of trimmed video clips.
We propose a causal debiasing approach and perform extensive experiments and ablation studies on the Epic-Kitchens-100, YouCook2, and MSR-VTT datasets.
arXiv Detail & Related papers (2023-09-17T15:58:27Z) - Counterfactual Cross-modality Reasoning for Weakly Supervised Video
Moment Localization [67.88493779080882]
Video moment localization aims to retrieve the target segment of an untrimmed video according to the natural language query.
Recent works contrast the cross-modality similarities driven by reconstructing masked queries.
We propose a novel proposed counterfactual cross-modality reasoning method.
arXiv Detail & Related papers (2023-08-10T15:45:45Z) - Towards Video Anomaly Retrieval from Video Anomaly Detection: New
Benchmarks and Model [70.97446870672069]
Video anomaly detection (VAD) has been paid increasing attention due to its potential applications.
Video Anomaly Retrieval ( VAR) aims to pragmatically retrieve relevant anomalous videos by cross-modalities.
We present two benchmarks, UCFCrime-AR and XD-Violence, constructed on top of prevalent anomaly datasets.
arXiv Detail & Related papers (2023-07-24T06:22:37Z) - MomentDiff: Generative Video Moment Retrieval from Random to Real [71.40038773943638]
We provide a generative diffusion-based framework called MomentDiff.
MomentDiff simulates a typical human retrieval process from random browsing to gradual localization.
We show that MomentDiff consistently outperforms state-of-the-art methods on three public benchmarks.
arXiv Detail & Related papers (2023-07-06T09:12:13Z) - Multi-Contextual Predictions with Vision Transformer for Video Anomaly
Detection [22.098399083491937]
understanding of thetemporal context of a video plays a vital role in anomaly detection.
We design a transformer model with three different contextual prediction streams: masked, whole and partial.
By learning to predict the missing frames of consecutive normal frames, our model can effectively learn various normality patterns in the video.
arXiv Detail & Related papers (2022-06-17T05:54:31Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - Learning Sample Importance for Cross-Scenario Video Temporal Grounding [30.82619216537177]
The paper investigates some superficial biases specific to the temporal grounding task.
We propose a novel method called Debiased Temporal Language Localizer (DebiasTLL) to prevent the model from naively memorizing the biases.
We evaluate the proposed model in cross-scenario temporal grounding, where the train / test data are heterogeneously sourced.
arXiv Detail & Related papers (2022-01-08T15:41:38Z) - Interventional Video Grounding with Dual Contrastive Learning [16.0734337895897]
Video grounding aims to localize a moment from an untrimmed video for a given textual query.
We propose a novel paradigm from the perspective of causal inference to uncover the causality behind the model and data.
We also introduce a dual contrastive learning approach to better align the text and video.
arXiv Detail & Related papers (2021-06-21T12:11:28Z) - Frame-wise Cross-modal Matching for Video Moment Retrieval [32.68921139236391]
Video moment retrieval targets at retrieving a moment in a video for a given language query.
The challenges of this task include 1) the requirement of localizing the relevant moment in an untrimmed video, and 2) bridging the semantic gap between textual query and video contents.
We propose an Attentive Cross-modal Relevance Matching model which predicts the temporal boundaries based on an interaction modeling.
arXiv Detail & Related papers (2020-09-22T10:25:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.