Related papers: The Devil is in the Spurious Correlation: Boosting Moment Retrieval via Temporal Dynamic Learning

The Devil is in the Spurious Correlation: Boosting Moment Retrieval via Temporal Dynamic Learning

URL: http://arxiv.org/abs/2501.07305v1
Date: Mon, 13 Jan 2025 13:13:06 GMT
Title: The Devil is in the Spurious Correlation: Boosting Moment Retrieval via Temporal Dynamic Learning
Authors: Xinyang Zhou, Fanyue Wei, Lixin Duan, Wen Li,
Abstract summary: We propose a temporal dynamic learning approach for moment retrieval, where two strategies are designed to mitigate the spurious correlation.<n>Our method establishes a new state-of-the-art performance on two popular benchmarks of moment retrieval, ie, QVHighlights and Charades-STA.
Score: 23.357772759438806
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Given a textual query along with a corresponding video, the objective of moment retrieval aims to localize the moments relevant to the query within the video. While commendable results have been demonstrated by existing transformer-based approaches, predicting the accurate temporal span of the target moment is currently still a major challenge. In this paper, we reveal that a crucial reason stems from the spurious correlation between the text queries and the moment context. Namely, the model may associate the textual query with the background frames rather than the target moment. To address this issue, we propose a temporal dynamic learning approach for moment retrieval, where two strategies are designed to mitigate the spurious correlation. First, we introduce a novel video synthesis approach to construct a dynamic context for the relevant moment. With separate yet similar videos mixed up, the synthesis approach empowers our model to attend to the target moment of the corresponding query under various dynamic contexts. Second, we enhance the representation by learning temporal dynamics. Besides the visual representation, text queries are aligned with temporal dynamic representations, which enables our model to establish a non-spurious correlation between the query-related moment and context. With the aforementioned proposed method, the spurious correlation issue in moment retrieval can be largely alleviated. Our method establishes a new state-of-the-art performance on two popular benchmarks of moment retrieval, \ie, QVHighlights and Charades-STA. In addition, the detailed ablation analyses demonstrate the effectiveness of the proposed strategies. Our code will be publicly available.

Related papers

Efficient and Effective Query Context-Aware Learning-to-Rank Model for Sequential Recommendation [0.027961972519572442]
This paper analyzes different strategies for incorporating query context into transformers trained with a causal language modeling procedure.<n>We propose a new method that effectively fuses the item sequence with query context within the attention mechanism.
arXiv Detail & Related papers (2025-07-04T19:50:01Z)
Exploiting Inter-Sample Correlation and Intra-Sample Redundancy for Partially Relevant Video Retrieval [5.849812241074385]
PRVR aims to retrieve the target video that is partially relevant to a text query. Existing methods coarsely align paired videos and text queries to construct the semantic space. We propose a novel PRVR framework to systematically exploit inter-sample correlation and intra-sample redundancy.
arXiv Detail & Related papers (2025-04-28T09:52:46Z)
Few-Shot, No Problem: Descriptive Continual Relation Extraction [27.296604792388646]
Few-shot Continual Relation Extraction is a crucial challenge for enabling AI systems to identify and adapt to evolving relationships in real-world domains. Traditional memory-based approaches often overfit to limited samples, failing to reinforce old knowledge. We propose a novel retrieval-based solution, starting with a large language model to generate descriptions for each relation.
arXiv Detail & Related papers (2025-02-27T23:44:30Z)
Frame Order Matters: A Temporal Sequence-Aware Model for Few-Shot Action Recognition [14.97527336050901]
We propose a novel Temporal Sequence-Aware Model (TSAM) for few-shot action recognition (FSAR) It incorporates a sequential perceiver adapter into the pre-training framework, to integrate both the spatial information and the sequential temporal dynamics into the feature embeddings. Experimental results on five FSAR datasets demonstrate that our method set a new benchmark, beating the second-best competitors with large margins.
arXiv Detail & Related papers (2024-08-22T15:13:27Z)
Disentangle and denoise: Tackling context misalignment for video moment retrieval [16.939535169282262]
Video Moment Retrieval aims to locate in-context video moments according to a natural language query. This paper proposes a cross-modal Context Denoising Network (CDNet) for accurate moment retrieval.
arXiv Detail & Related papers (2024-08-14T15:00:27Z)
Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement [72.7576395034068]
Video Corpus Moment Retrieval (VCMR) is a new video retrieval task aimed at retrieving a relevant moment from a large corpus of untrimmed videos using a text query. We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task. For video retrieval, we introduce a multi-modal collaborative video retriever, generating different query representations for the two modalities. For moment localization, we propose the focus-then-fuse moment localizer, utilizing modality-specific gates to capture essential content.
arXiv Detail & Related papers (2024-02-21T07:16:06Z)
Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models [58.17315970207874]
We propose a zero-shot method for adapting generalisable visual-textual priors from arbitrary VLM to facilitate moment-text alignment. Experiments conducted on three VMR benchmark datasets demonstrate the notable performance advantages of our zero-shot algorithm.
arXiv Detail & Related papers (2023-09-01T13:06:50Z)
Background-aware Moment Detection for Video Moment Retrieval [19.11524416308641]
Video moment retrieval (VMR) identifies a specific moment in an untrimmed video for a given natural language query. Due to the ambiguity, a query does not fully cover the relevant details of the corresponding moment. We propose a background-aware moment detection transformer (BM-DETR) Our model learns to predict the target moment from the joint probability of each frame given the positive query and the complement of negative queries.
arXiv Detail & Related papers (2023-06-05T09:26:33Z)
Multimodal Relation Extraction with Cross-Modal Retrieval and Synthesis [89.04041100520881]
This research proposes to retrieve textual and visual evidence based on the object, sentence, and whole image. We develop a novel approach to synthesize the object-level, image-level, and sentence-level information for better reasoning between the same and different modalities.
arXiv Detail & Related papers (2023-05-25T15:26:13Z)
Temporal Relevance Analysis for Video Action Models [70.39411261685963]
We first propose a new approach to quantify the temporal relationships between frames captured by CNN-based action models. We then conduct comprehensive experiments and in-depth analysis to provide a better understanding of how temporal modeling is affected.
arXiv Detail & Related papers (2022-04-25T19:06:48Z)
Deconfounded Video Moment Retrieval with Causal Intervention [80.90604360072831]
We tackle the task of video moment retrieval (VMR), which aims to localize a specific moment in a video according to a textual query. Existing methods primarily model the matching relationship between query and moment by complex cross-modal interactions. We propose a causality-inspired VMR framework that builds structural causal model to capture the true effect of query and video content on the prediction.
arXiv Detail & Related papers (2021-06-03T01:33:26Z)
Modeling long-term interactions to enhance action recognition [81.09859029964323]
We propose a new approach to under-stand actions in egocentric videos that exploits the semantics of object interactions at both frame and temporal levels. We use a region-based approach that takes as input a primary region roughly corresponding to the user hands and a set of secondary regions potentially corresponding to the interacting objects. The proposed approach outperforms the state-of-the-art in terms of action recognition on standard benchmarks.
arXiv Detail & Related papers (2021-04-23T10:08:15Z)
DORi: Discovering Object Relationship for Moment Localization of a Natural-Language Query in Video [98.54696229182335]
We study the task of temporal moment localization in a long untrimmed video using natural language query. Our key innovation is to learn a video feature embedding through a language-conditioned message-passing algorithm. A temporal sub-graph captures the activities within the video through time.
arXiv Detail & Related papers (2020-10-13T09:50:29Z)
Video Moment Retrieval via Natural Language Queries [7.611718124254329]
We propose a novel method for video moment retrieval (VMR) that achieves state of the arts (SOTA) performance on R@1 metrics. Our model has a simple architecture, which enables faster training and inference while maintaining.
arXiv Detail & Related papers (2020-09-04T22:06:34Z)
Dynamic Language Binding in Relational Visual Reasoning [67.85579756590478]
We present Language-binding Object Graph Network, the first neural reasoning method with dynamic relational structures across both visual and textual domains. Our method outperforms other methods in sophisticated question-answering tasks wherein multiple object relations are involved.
arXiv Detail & Related papers (2020-04-30T06:26:20Z)
Local-Global Video-Text Interactions for Temporal Grounding [77.5114709695216]
This paper addresses the problem of text-to-video temporal grounding, which aims to identify the time interval in a video semantically relevant to a text query. We tackle this problem using a novel regression-based model that learns to extract a collection of mid-level features for semantic phrases in a text query. The proposed method effectively predicts the target time interval by exploiting contextual information from local to global.
arXiv Detail & Related papers (2020-04-16T08:10:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.