Counterfactual Cross-modality Reasoning for Weakly Supervised Video
Moment Localization
- URL: http://arxiv.org/abs/2308.05648v2
- Date: Sat, 14 Oct 2023 16:16:31 GMT
- Title: Counterfactual Cross-modality Reasoning for Weakly Supervised Video
Moment Localization
- Authors: Zezhong Lv, Bing Su, Ji-Rong Wen
- Abstract summary: Video moment localization aims to retrieve the target segment of an untrimmed video according to the natural language query.
Recent works contrast the cross-modality similarities driven by reconstructing masked queries.
We propose a novel proposed counterfactual cross-modality reasoning method.
- Score: 67.88493779080882
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video moment localization aims to retrieve the target segment of an untrimmed
video according to the natural language query. Weakly supervised methods gains
attention recently, as the precise temporal location of the target segment is
not always available. However, one of the greatest challenges encountered by
the weakly supervised method is implied in the mismatch between the video and
language induced by the coarse temporal annotations. To refine the
vision-language alignment, recent works contrast the cross-modality
similarities driven by reconstructing masked queries between positive and
negative video proposals. However, the reconstruction may be influenced by the
latent spurious correlation between the unmasked and the masked parts, which
distorts the restoring process and further degrades the efficacy of contrastive
learning since the masked words are not completely reconstructed from the
cross-modality knowledge. In this paper, we discover and mitigate this spurious
correlation through a novel proposed counterfactual cross-modality reasoning
method. Specifically, we first formulate query reconstruction as an aggregated
causal effect of cross-modality and query knowledge. Then by introducing
counterfactual cross-modality knowledge into this aggregation, the spurious
impact of the unmasked part contributing to the reconstruction is explicitly
modeled. Finally, by suppressing the unimodal effect of masked query, we can
rectify the reconstructions of video proposals to perform reasonable
contrastive learning. Extensive experimental evaluations demonstrate the
effectiveness of our proposed method. The code is available at
\href{https://github.com/sLdZ0306/CCR}{https://github.com/sLdZ0306/CCR}.
Related papers
- Cycle-Consistency Uncertainty Estimation for Visual Prompting based One-Shot Defect Segmentation [0.0]
Industrial defect detection traditionally relies on supervised learning models trained on fixed datasets of known defect types.
Recent advances in visual prompting offer a solution by allowing models to adaptively infer novel categories based on provided visual cues.
We propose a solution to estimate uncertainty of the visual prompting process by cycle-consistency.
arXiv Detail & Related papers (2024-09-21T02:25:32Z) - SHINE: Saliency-aware HIerarchical NEgative Ranking for Compositional Temporal Grounding [52.98133831401225]
Temporal grounding, also known as video moment retrieval, aims at locating video segments corresponding to a given query sentence.
We propose a large language model-driven method for negative query construction, utilizing GPT-3.5-Turbo.
We introduce a coarse-to-fine saliency ranking strategy, which encourages the model to learn the multi-granularity semantic relationships between videos and hierarchical negative queries.
arXiv Detail & Related papers (2024-07-06T16:08:17Z) - DiffusionVMR: Diffusion Model for Joint Video Moment Retrieval and
Highlight Detection [38.12212015133935]
A novel framework, DiffusionVMR, is proposed to redefine the two tasks as a unified conditional denoising generation process.
Experiments conducted on five widely-used benchmarks demonstrate the effectiveness and flexibility of the proposed DiffusionVMR.
arXiv Detail & Related papers (2023-08-29T08:20:23Z) - On the Importance of Spatial Relations for Few-shot Action Recognition [109.2312001355221]
In this paper, we investigate the importance of spatial relations and propose a more accurate few-shot action recognition method.
A novel Spatial Alignment Cross Transformer (SA-CT) learns to re-adjust the spatial relations and incorporates the temporal information.
Experiments reveal that, even without using any temporal information, the performance of SA-CT is comparable to temporal based methods on 3/4 benchmarks.
arXiv Detail & Related papers (2023-08-14T12:58:02Z) - Semantics-Aware Dynamic Localization and Refinement for Referring Image
Segmentation [102.25240608024063]
Referring image segments an image from a language expression.
We develop an algorithm that shifts from being localization-centric to segmentation-language.
Compared to its counterparts, our method is more versatile yet effective.
arXiv Detail & Related papers (2023-03-11T08:42:40Z) - Deconfounded Video Moment Retrieval with Causal Intervention [80.90604360072831]
We tackle the task of video moment retrieval (VMR), which aims to localize a specific moment in a video according to a textual query.
Existing methods primarily model the matching relationship between query and moment by complex cross-modal interactions.
We propose a causality-inspired VMR framework that builds structural causal model to capture the true effect of query and video content on the prediction.
arXiv Detail & Related papers (2021-06-03T01:33:26Z) - Dynamic Facial Expression Recognition under Partial Occlusion with
Optical Flow Reconstruction [20.28462460359439]
We propose a new solution based on an auto-encoder with skip connections to reconstruct the occluded part of the face in the optical flow domain.
Our experiments show that the proposed method reduce significantly the gap, in terms of recognition accuracy, between occluded and non-occluded situations.
arXiv Detail & Related papers (2020-12-24T12:28:47Z) - Regularized Two-Branch Proposal Networks for Weakly-Supervised Moment
Retrieval in Videos [108.55320735031721]
Video moment retrieval aims to localize the target moment in a video according to the given sentence.
Most existing weak-supervised methods apply a MIL-based framework to develop inter-sample confrontment.
We propose a novel Regularized Two-Branch Proposal Network to simultaneously consider the inter-sample and intra-sample confrontments.
arXiv Detail & Related papers (2020-08-19T04:42:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.