Related papers: Disentangle and denoise: Tackling context misalignment for video moment retrieval

Disentangle and denoise: Tackling context misalignment for video moment retrieval

URL: http://arxiv.org/abs/2408.07600v1
Date: Wed, 14 Aug 2024 15:00:27 GMT
Title: Disentangle and denoise: Tackling context misalignment for video moment retrieval
Authors: Kaijing Ma, Han Fang, Xianghao Zang, Chao Ban, Lanxiang Zhou, Zhongjiang He, Yongxiang Li, Hao Sun, Zerun Feng, Xingsong Hou,
Abstract summary: Video Moment Retrieval aims to locate in-context video moments according to a natural language query. This paper proposes a cross-modal Context Denoising Network (CDNet) for accurate moment retrieval.
Score: 16.939535169282262
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video Moment Retrieval, which aims to locate in-context video moments according to a natural language query, is an essential task for cross-modal grounding. Existing methods focus on enhancing the cross-modal interactions between all moments and the textual description for video understanding. However, constantly interacting with all locations is unreasonable because of uneven semantic distribution across the timeline and noisy visual backgrounds. This paper proposes a cross-modal Context Denoising Network (CDNet) for accurate moment retrieval by disentangling complex correlations and denoising irrelevant dynamics.Specifically, we propose a query-guided semantic disentanglement (QSD) to decouple video moments by estimating alignment levels according to the global and fine-grained correlation. A Context-aware Dynamic Denoisement (CDD) is proposed to enhance understanding of aligned spatial-temporal details by learning a group of query-relevant offsets. Extensive experiments on public benchmarks demonstrate that the proposed CDNet achieves state-of-the-art performances.

Related papers

Context-aware TFL: A Universal Context-aware Contrastive Learning Framework for Temporal Forgery Localization [60.73623588349311]
We propose a universal context-aware contrastive learning framework (UniCaCLF) for temporal forgery localization.<n>Our approach leverages supervised contrastive learning to discover and identify forged instants by means of anomaly detection.<n>An efficient context-aware contrastive coding is introduced to further push the limit of instant feature distinguishability between genuine and forged instants.
arXiv Detail & Related papers (2025-06-10T06:40:43Z)
Collaborative Temporal Consistency Learning for Point-supervised Natural Language Video Localization [129.43937834515688]
We propose a new COllaborative Temporal consistEncy Learning (COTEL) framework to strengthen the video-language alignment. Specifically, we first design a frame- and a segment-level Temporal Consistency Learning (TCL) module that models semantic alignment across frame saliencies and sentence-moment pairs.
arXiv Detail & Related papers (2025-03-22T05:04:12Z)
The Devil is in the Spurious Correlations: Boosting Moment Retrieval with Dynamic Learning [49.40254251698784]
We propose a dynamic learning approach for moment retrieval, where two strategies are designed to mitigate the spurious correlation. First, we introduce a novel video synthesis approach to construct a dynamic context for the queried moment. Second, to alleviate the over-association with backgrounds, we enhance representations temporally by incorporating text-dynamics interaction.
arXiv Detail & Related papers (2025-01-13T13:13:06Z)
SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation [35.063881868130075]
This paper studies referring video object segmentation (RVOS) by boosting video-level visual-linguistic alignment. We propose Semantic-assisted Object Cluster (SOC), which aggregates video content and textual guidance for unified temporal modeling and cross-modal alignment. We conduct extensive experiments on popular RVOS benchmarks, and our method outperforms state-of-the-art competitors on all benchmarks by a remarkable margin.
arXiv Detail & Related papers (2023-05-26T15:13:44Z)
Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal Grounding [78.71529237748018]
Grounding temporal video segments described in natural language queries effectively and efficiently is a crucial capability needed in vision-and-language fields. Most existing approaches adopt elaborately designed cross-modal interaction modules to improve the grounding performance. We propose a commonsense-aware cross-modal alignment framework, which incorporates commonsense-guided visual and text representations into a complementary common space.
arXiv Detail & Related papers (2022-04-04T13:07:05Z)
Hierarchical Deep Residual Reasoning for Temporal Moment Localization [48.108468456043994]
We propose a Hierarchical Deep Residual Reasoning (HDRR) model, which decomposes the video and sentence into multi-level representations with different semantics. We also design the simple yet effective Res-BiGRUs for feature fusion, which is able to grasp the useful information in a self-adapting manner.
arXiv Detail & Related papers (2021-10-31T07:13:34Z)
Multi-Modal Interaction Graph Convolutional Network for Temporal Language Localization in Videos [55.52369116870822]
This paper focuses on tackling the problem of temporal language localization in videos. It aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video.
arXiv Detail & Related papers (2021-10-12T14:59:25Z)
Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual Speech Separation [73.1652905564163]
We address the problem of separating individual speech signals from videos using audio-visual neural processing. Most conventional approaches utilize frame-wise matching criteria to extract shared information between co-occurring audio and video. We propose a cross-modal affinity network (CaffNet) that learns global correspondence as well as locally-varying affinities between audio and visual streams.
arXiv Detail & Related papers (2021-03-25T15:39:12Z)
Context-aware Biaffine Localizing Network for Temporal Sentence Grounding [61.18824806906945]
This paper addresses the problem of temporal sentence grounding (TSG) TSG aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query. We propose a novel localization framework that scores all pairs of start and end indices within the video simultaneously with a biaffine mechanism.
arXiv Detail & Related papers (2021-03-22T03:13:05Z)
Local-Global Video-Text Interactions for Temporal Grounding [77.5114709695216]
This paper addresses the problem of text-to-video temporal grounding, which aims to identify the time interval in a video semantically relevant to a text query. We tackle this problem using a novel regression-based model that learns to extract a collection of mid-level features for semantic phrases in a text query. The proposed method effectively predicts the target time interval by exploiting contextual information from local to global.
arXiv Detail & Related papers (2020-04-16T08:10:41Z)
Co-Saliency Spatio-Temporal Interaction Network for Person Re-Identification in Videos [85.6430597108455]
We propose a novel Co-Saliency Spatio-Temporal Interaction Network (CSTNet) for person re-identification in videos. It captures the common salient foreground regions among video frames and explores the spatial-temporal long-range context interdependency from such regions. Multiple spatialtemporal interaction modules within CSTNet are proposed, which exploit the spatial and temporal long-range context interdependencies on such features and spatial-temporal information correlation.
arXiv Detail & Related papers (2020-04-10T10:23:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.