Interventional Video Grounding with Dual Contrastive Learning
- URL: http://arxiv.org/abs/2106.11013v1
- Date: Mon, 21 Jun 2021 12:11:28 GMT
- Title: Interventional Video Grounding with Dual Contrastive Learning
- Authors: Guoshun Nan, Rui Qiao, Yao Xiao, Jun Liu, Sicong Leng, Hao Zhang, Wei
Lu
- Abstract summary: Video grounding aims to localize a moment from an untrimmed video for a given textual query.
We propose a novel paradigm from the perspective of causal inference to uncover the causality behind the model and data.
We also introduce a dual contrastive learning approach to better align the text and video.
- Score: 16.0734337895897
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video grounding aims to localize a moment from an untrimmed video for a given
textual query. Existing approaches focus more on the alignment of visual and
language stimuli with various likelihood-based matching or regression
strategies, i.e., P(Y|X). Consequently, these models may suffer from spurious
correlations between the language and video features due to the selection bias
of the dataset. 1) To uncover the causality behind the model and data, we first
propose a novel paradigm from the perspective of the causal inference, i.e.,
interventional video grounding (IVG) that leverages backdoor adjustment to
deconfound the selection bias based on structured causal model (SCM) and
do-calculus P(Y|do(X)). Then, we present a simple yet effective method to
approximate the unobserved confounder as it cannot be directly sampled from the
dataset. 2) Meanwhile, we introduce a dual contrastive learning approach (DCL)
to better align the text and video by maximizing the mutual information (MI)
between query and video clips, and the MI between start/end frames of a target
moment and the others within a video to learn more informative visual
representations. Experiments on three standard benchmarks show the
effectiveness of our approaches.
Related papers
- Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding [108.79026216923984]
Video grounding aims to localize a-temporal section in a video corresponding to an input text query.
This paper addresses a critical limitation in current video grounding methodologies by introducing an Open-Vocabulary Spatio-Temporal Video Grounding task.
arXiv Detail & Related papers (2023-12-31T13:53:37Z) - Towards Debiasing Frame Length Bias in Text-Video Retrieval via Causal
Intervention [72.12974259966592]
We present a unique and systematic study of a temporal bias due to frame length discrepancy between training and test sets of trimmed video clips.
We propose a causal debiasing approach and perform extensive experiments and ablation studies on the Epic-Kitchens-100, YouCook2, and MSR-VTT datasets.
arXiv Detail & Related papers (2023-09-17T15:58:27Z) - Simplifying Open-Set Video Domain Adaptation with Contrastive Learning [16.72734794723157]
unsupervised video domain adaptation methods have been proposed to adapt a predictive model from a labelled dataset to an unlabelled dataset.
We address a more realistic scenario, called open-set video domain adaptation (OUVDA), where the target dataset contains "unknown" semantic categories that are not shared with the source.
We propose a video-oriented temporal contrastive loss that enables our method to better cluster the feature space by exploiting the freely available temporal information in video data.
arXiv Detail & Related papers (2023-01-09T13:16:50Z) - Mitigating Representation Bias in Action Recognition: Algorithms and
Benchmarks [76.35271072704384]
Deep learning models perform poorly when applied to videos with rare scenes or objects.
We tackle this problem from two different angles: algorithm and dataset.
We show that the debiased representation can generalize better when transferred to other datasets and tasks.
arXiv Detail & Related papers (2022-09-20T00:30:35Z) - Multi-Contextual Predictions with Vision Transformer for Video Anomaly
Detection [22.098399083491937]
understanding of thetemporal context of a video plays a vital role in anomaly detection.
We design a transformer model with three different contextual prediction streams: masked, whole and partial.
By learning to predict the missing frames of consecutive normal frames, our model can effectively learn various normality patterns in the video.
arXiv Detail & Related papers (2022-06-17T05:54:31Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - Towards Debiasing Temporal Sentence Grounding in Video [59.42702544312366]
temporal sentence grounding in video (TSGV) task is to locate a temporal moment from an untrimmed video, to match a language query.
Without considering bias in moment annotations, many models tend to capture statistical regularities of the moment annotations.
We propose two debiasing strategies, data debiasing and model debiasing, to "force" a TSGV model to capture cross-modal interactions.
arXiv Detail & Related papers (2021-11-08T08:18:25Z) - Deconfounded Video Moment Retrieval with Causal Intervention [80.90604360072831]
We tackle the task of video moment retrieval (VMR), which aims to localize a specific moment in a video according to a textual query.
Existing methods primarily model the matching relationship between query and moment by complex cross-modal interactions.
We propose a causality-inspired VMR framework that builds structural causal model to capture the true effect of query and video content on the prediction.
arXiv Detail & Related papers (2021-06-03T01:33:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.