Grounding-MD: Grounded Video-language Pre-training for Open-World Moment Detection
- URL: http://arxiv.org/abs/2504.14553v1
- Date: Sun, 20 Apr 2025 09:54:25 GMT
- Title: Grounding-MD: Grounded Video-language Pre-training for Open-World Moment Detection
- Authors: Weijun Zhuang, Qizhang Li, Xin Li, Ming Liu, Xiaopeng Hong, Feng Gao, Fan Yang, Wangmeng Zuo,
- Abstract summary: Grounding-MD is an innovative, grounded video-language pre-training framework tailored for open-world moment detection.<n>Our framework incorporates an arbitrary number of open-ended natural language queries through a structured prompt mechanism.<n>Grounding-MD demonstrates exceptional semantic representation learning capabilities, effectively handling diverse and complex query conditions.
- Score: 67.70328796057466
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal Action Detection and Moment Retrieval constitute two pivotal tasks in video understanding, focusing on precisely localizing temporal segments corresponding to specific actions or events. Recent advancements introduced Moment Detection to unify these two tasks, yet existing approaches remain confined to closed-set scenarios, limiting their applicability in open-world contexts. To bridge this gap, we present Grounding-MD, an innovative, grounded video-language pre-training framework tailored for open-world moment detection. Our framework incorporates an arbitrary number of open-ended natural language queries through a structured prompt mechanism, enabling flexible and scalable moment detection. Grounding-MD leverages a Cross-Modality Fusion Encoder and a Text-Guided Fusion Decoder to facilitate comprehensive video-text alignment and enable effective cross-task collaboration. Through large-scale pre-training on temporal action detection and moment retrieval datasets, Grounding-MD demonstrates exceptional semantic representation learning capabilities, effectively handling diverse and complex query conditions. Comprehensive evaluations across four benchmark datasets including ActivityNet, THUMOS14, ActivityNet-Captions, and Charades-STA demonstrate that Grounding-MD establishes new state-of-the-art performance in zero-shot and supervised settings in open-world moment detection scenarios. All source code and trained models will be released.
Related papers
- Spatio-Temporal Context Prompting for Zero-Shot Action Detection [13.22912547389941]
We propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction.<n>To address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism.<n>Our method achieves superior results compared to previous approaches and can be further extended to multi-action videos.
arXiv Detail & Related papers (2024-08-28T17:59:05Z) - Open-Vocabulary Spatio-Temporal Action Detection [59.91046192096296]
Open-vocabulary-temporal action detection (OV-STAD) is an important fine-grained video understanding task.
OV-STAD requires training a model on a limited set of base classes with box and label supervision.
To better adapt the holistic VLM for the fine-grained action detection task, we carefully fine-tune it on the localized video region-text pairs.
arXiv Detail & Related papers (2024-05-17T14:52:47Z) - UniMD: Towards Unifying Moment Retrieval and Temporal Action Detection [19.595956464166548]
Temporal Action Detection (TAD) focuses on detecting pre-defined actions, while Moment Retrieval (MR) aims to identify the events described by open-ended natural language within untrimmed videos.
We propose a unified architecture, termed Unified Moment Detection (UniMD), for both TAD and MR.
It transforms the inputs of the two tasks, namely actions for TAD or events for MR, into a common embedding space, and utilizes two novel query-dependent decoders to generate a uniform output of classification score and temporal segments.
arXiv Detail & Related papers (2024-04-07T12:14:42Z) - What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions [55.574102714832456]
spatial-temporal grounding describes the task of localizing events in space and time.
Models for this task are usually trained with human-annotated sentences and bounding box supervision.
We combine local representation learning, which focuses on fine-grained spatial information, with a global representation that captures higher-level representations.
arXiv Detail & Related papers (2023-03-29T19:38:23Z) - MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection [37.25262046781015]
Action detection is an essential and challenging task, especially for densely labelled datasets of untrimmed videos.
We propose a novel ConvTransformer network for action detection that efficiently captures both short-term and long-term temporal information.
Our network outperforms the state-of-the-art methods on all three datasets.
arXiv Detail & Related papers (2021-12-07T18:57:37Z) - Weakly Supervised Temporal Action Localization Through Learning Explicit
Subspaces for Action and Context [151.23835595907596]
Methods learn to localize temporal starts and ends of action instances in a video under only video-level supervision.
We introduce a framework that learns two feature subspaces respectively for actions and their context.
The proposed approach outperforms state-of-the-art WS-TAL methods on three benchmarks.
arXiv Detail & Related papers (2021-03-30T08:26:53Z) - Language Guided Networks for Cross-modal Moment Retrieval [66.49445903955777]
Cross-modal moment retrieval aims to localize a temporal segment from an untrimmed video described by a natural language query.
Existing methods independently extract the features of videos and sentences.
We present Language Guided Networks (LGN), a new framework that leverages the sentence embedding to guide the whole process of moment retrieval.
arXiv Detail & Related papers (2020-06-18T12:08:40Z) - Local-Global Video-Text Interactions for Temporal Grounding [77.5114709695216]
This paper addresses the problem of text-to-video temporal grounding, which aims to identify the time interval in a video semantically relevant to a text query.
We tackle this problem using a novel regression-based model that learns to extract a collection of mid-level features for semantic phrases in a text query.
The proposed method effectively predicts the target time interval by exploiting contextual information from local to global.
arXiv Detail & Related papers (2020-04-16T08:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.