MS-DETR: Towards Effective Video Moment Retrieval and Highlight Detection by Joint Motion-Semantic Learning
- URL: http://arxiv.org/abs/2507.12062v1
- Date: Wed, 16 Jul 2025 09:18:18 GMT
- Title: MS-DETR: Towards Effective Video Moment Retrieval and Highlight Detection by Joint Motion-Semantic Learning
- Authors: Hongxu Ma, Guanshuo Wang, Fufu Yu, Qiong Jia, Shouhong Ding,
- Abstract summary: Video Moment Retrieval (MR) and Highlight Detection (HD) aim to pinpoint specific moments and assess clip-wise relevance based on the text query.<n>We propose the Motion-Semantics DETR (MS-DETR), a framework that captures rich motion-semantics features through unified learning for MR/HD tasks.<n>Our method outperforms existing state-of-the-art models by a margin.
- Score: 26.58473269689558
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video Moment Retrieval (MR) and Highlight Detection (HD) aim to pinpoint specific moments and assess clip-wise relevance based on the text query. While DETR-based joint frameworks have made significant strides, there remains untapped potential in harnessing the intricate relationships between temporal motion and spatial semantics within video content. In this paper, we propose the Motion-Semantics DETR (MS-DETR), a framework that captures rich motion-semantics features through unified learning for MR/HD tasks. The encoder first explicitly models disentangled intra-modal correlations within motion and semantics dimensions, guided by the given text queries. Subsequently, the decoder utilizes the task-wise correlation across temporal motion and spatial semantics dimensions to enable precise query-guided localization for MR and refined highlight boundary delineation for HD. Furthermore, we observe the inherent sparsity dilemma within the motion and semantics dimensions of MR/HD datasets. To address this issue, we enrich the corpus from both dimensions by generation strategies and propose contrastive denoising learning to ensure the above components learn robustly and effectively. Extensive experiments on four MR/HD benchmarks demonstrate that our method outperforms existing state-of-the-art models by a margin. Our code is available at https://github.com/snailma0229/MS-DETR.git.
Related papers
- Multi-Timescale Motion-Decoupled Spiking Transformer for Audio-Visual Zero-Shot Learning [73.7808110878037]
This paper proposes a novel dual-stream Multi-Timescale Motion-Decoupled Spiking Transformer (MDST++)<n>By converting RGB images to events, our method captures motion information more accurately and mitigates background scene biases.<n>Our experiments validate the effectiveness of MDST++, demonstrating their consistent superiority over state-of-the-art methods on mainstream benchmarks.
arXiv Detail & Related papers (2025-05-26T13:06:01Z) - Bridging Vision and Language: Modeling Causality and Temporality in Video Narratives [0.0]
We propose an enhanced framework that integrates a Causal-Temporal Reasoning Module into state-of-the-art LVLMs.<n>CTRM comprises two key components: the Causal Dynamics (CDE) and the Temporal Learner (TRL)<n>We design a multi-stage learning strategy to optimize the model, combining pre-training on large-scale video-text datasets.
arXiv Detail & Related papers (2024-12-14T07:28:38Z) - Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion [57.232688209606515]
We present HTCL, a novel Temporal Temporal Context Learning paradigm for improving camera-based semantic scene completion.
Our method ranks $1st$ on the Semantic KITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU.
arXiv Detail & Related papers (2024-07-02T09:11:17Z) - TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and
Highlight Detection [9.032057312774564]
Video moment retrieval (MR) and highlight detection (HD) based on natural language queries are two highly related tasks.
Several methods have been devoted to building DETR-based networks to solve both MR and HD jointly.
We propose a task-reciprocal transformer based on DETR (TR-DETR) that focuses on exploring the inherent reciprocity between MR and HD.
arXiv Detail & Related papers (2024-01-04T14:55:57Z) - Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models [58.17315970207874]
We propose a zero-shot method for adapting generalisable visual-textual priors from arbitrary VLM to facilitate moment-text alignment.
Experiments conducted on three VMR benchmark datasets demonstrate the notable performance advantages of our zero-shot algorithm.
arXiv Detail & Related papers (2023-09-01T13:06:50Z) - Self-Supervised Video Representation Learning via Latent Time Navigation [12.721647696921865]
Self-supervised video representation learning aims at maximizing similarity between different temporal segments of one video.
We propose Latent Time Navigation (LTN) to capture fine-grained motions.
Our experimental analysis suggests that learning video representations by LTN consistently improves performance of action classification.
arXiv Detail & Related papers (2023-05-10T20:06:17Z) - Implicit Motion Handling for Video Camouflaged Object Detection [60.98467179649398]
We propose a new video camouflaged object detection (VCOD) framework.
It can exploit both short-term and long-term temporal consistency to detect camouflaged objects from video frames.
arXiv Detail & Related papers (2022-03-14T17:55:41Z) - Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for
Temporal Sentence Grounding [61.57847727651068]
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query.
Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information.
We propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features.
arXiv Detail & Related papers (2022-03-06T13:57:09Z) - Exploring Motion and Appearance Information for Temporal Sentence
Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding.
We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations.
Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z) - Hierarchical Deep Residual Reasoning for Temporal Moment Localization [48.108468456043994]
We propose a Hierarchical Deep Residual Reasoning (HDRR) model, which decomposes the video and sentence into multi-level representations with different semantics.
We also design the simple yet effective Res-BiGRUs for feature fusion, which is able to grasp the useful information in a self-adapting manner.
arXiv Detail & Related papers (2021-10-31T07:13:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.