UniMD: Towards Unifying Moment Retrieval and Temporal Action Detection
- URL: http://arxiv.org/abs/2404.04933v2
- Date: Thu, 11 Jul 2024 12:03:35 GMT
- Title: UniMD: Towards Unifying Moment Retrieval and Temporal Action Detection
- Authors: Yingsen Zeng, Yujie Zhong, Chengjian Feng, Lin Ma,
- Abstract summary: Temporal Action Detection (TAD) focuses on detecting pre-defined actions, while Moment Retrieval (MR) aims to identify the events described by open-ended natural language within untrimmed videos.
We propose a unified architecture, termed Unified Moment Detection (UniMD), for both TAD and MR.
It transforms the inputs of the two tasks, namely actions for TAD or events for MR, into a common embedding space, and utilizes two novel query-dependent decoders to generate a uniform output of classification score and temporal segments.
- Score: 19.595956464166548
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal Action Detection (TAD) focuses on detecting pre-defined actions, while Moment Retrieval (MR) aims to identify the events described by open-ended natural language within untrimmed videos. Despite that they focus on different events, we observe they have a significant connection. For instance, most descriptions in MR involve multiple actions from TAD. In this paper, we aim to investigate the potential synergy between TAD and MR. Firstly, we propose a unified architecture, termed Unified Moment Detection (UniMD), for both TAD and MR. It transforms the inputs of the two tasks, namely actions for TAD or events for MR, into a common embedding space, and utilizes two novel query-dependent decoders to generate a uniform output of classification score and temporal segments. Secondly, we explore the efficacy of two task fusion learning approaches, pre-training and co-training, in order to enhance the mutual benefits between TAD and MR. Extensive experiments demonstrate that the proposed task fusion learning scheme enables the two tasks to help each other and outperform the separately trained counterparts. Impressively, UniMD achieves state-of-the-art results on three paired datasets Ego4D, Charades-STA, and ActivityNet. Our code is available at https://github.com/yingsen1/UniMD.
Related papers
- Grounding-MD: Grounded Video-language Pre-training for Open-World Moment Detection [67.70328796057466]
Grounding-MD is an innovative, grounded video-language pre-training framework tailored for open-world moment detection.
Our framework incorporates an arbitrary number of open-ended natural language queries through a structured prompt mechanism.
Grounding-MD demonstrates exceptional semantic representation learning capabilities, effectively handling diverse and complex query conditions.
arXiv Detail & Related papers (2025-04-20T09:54:25Z) - Bidirectional Decoding: Improving Action Chunking via Closed-Loop Resampling [51.38330727868982]
Bidirectional Decoding (BID) is a test-time inference algorithm that bridges action chunking with closed-loop operations.
We show that BID boosts the performance of two state-of-the-art generative policies across seven simulation benchmarks and two real-world tasks.
arXiv Detail & Related papers (2024-08-30T15:39:34Z) - Narrative Action Evaluation with Prompt-Guided Multimodal Interaction [60.281405999483]
Narrative action evaluation (NAE) aims to generate professional commentary that evaluates the execution of an action.
NAE is a more challenging task because it requires both narrative flexibility and evaluation rigor.
We propose a prompt-guided multimodal interaction framework to facilitate the interaction between different modalities of information.
arXiv Detail & Related papers (2024-04-22T17:55:07Z) - Task-Driven Exploration: Decoupling and Inter-Task Feedback for Joint Moment Retrieval and Highlight Detection [7.864892339833315]
We propose a novel task-driven top-down framework for joint moment retrieval and highlight detection.
The framework introduces a task-decoupled unit to capture task-specific and common representations.
Comprehensive experiments and in-depth ablation studies on QVHighlights, TVSum, and Charades-STA datasets corroborate the effectiveness and flexibility of the proposed framework.
arXiv Detail & Related papers (2024-04-14T14:06:42Z) - Unified Demonstration Retriever for In-Context Learning [56.06473069923567]
Unified Demonstration Retriever (textbfUDR) is a single model to retrieve demonstrations for a wide range of tasks.
We propose a multi-task list-wise ranking training framework, with an iterative mining strategy to find high-quality candidates.
Experiments on 30+ tasks across 13 task families and multiple data domains show that UDR significantly outperforms baselines.
arXiv Detail & Related papers (2023-05-07T16:07:11Z) - DropMAE: Learning Representations via Masked Autoencoders with Spatial-Attention Dropout for Temporal Matching Tasks [77.84636815364905]
This paper studies masked autoencoder (MAE) video pre-training for various temporal matching-based downstream tasks.
We propose DropMAE, which adaptively performs spatial-attention dropout in the frame reconstruction to facilitate temporal correspondence learning in videos.
arXiv Detail & Related papers (2023-04-02T16:40:42Z) - HyRSM++: Hybrid Relation Guided Temporal Set Matching for Few-shot
Action Recognition [51.2715005161475]
We propose a novel Hybrid Relation guided temporal Set Matching approach for few-shot action recognition.
The core idea of HyRSM++ is to integrate all videos within the task to learn discriminative representations.
We show that our method achieves state-of-the-art performance under various few-shot settings.
arXiv Detail & Related papers (2023-01-09T13:32:50Z) - A Unified Multi-task Learning Framework for Multi-goal Conversational
Recommender Systems [91.70511776167488]
Four tasks are often involved in MG-CRS, including Goal Planning, Topic Prediction, Item Recommendation, and Response Generation.
We propose a novel Unified MultI-goal conversational recommeNDer system, namely UniMIND.
Prompt-based learning strategies are investigated to endow the unified model with the capability of multi-task learning.
arXiv Detail & Related papers (2022-04-14T12:31:27Z) - MRI-based Multi-task Decoupling Learning for Alzheimer's Disease
Detection and MMSE Score Prediction: A Multi-site Validation [9.427540028148963]
Accurately detecting Alzheimer's disease (AD) and predicting mini-mental state examination (MMSE) score are important tasks in elderly health by magnetic resonance imaging (MRI)
Most of the previous methods on these two tasks are based on single-task learning and rarely consider the correlation between them.
We propose a MRI-based multi-task decoupled learning method for AD detection and MMSE score prediction.
arXiv Detail & Related papers (2022-04-02T09:19:18Z) - DARER: Dual-task Temporal Relational Recurrent Reasoning Network for
Joint Dialog Sentiment Classification and Act Recognition [39.76268402567324]
Task of joint dialog sentiment classification (DSC) and act recognition (DAR) aims to simultaneously predict the sentiment label and act label for each utterance in a dialog.
We put forward a new framework which models the explicit dependencies via integrating textitprediction-level interactions.
To implement our framework, we propose a novel model dubbed DARER, which first generates the context-, speaker- and temporal-sensitive utterance representations.
arXiv Detail & Related papers (2022-03-08T05:19:18Z) - CTRN: Class-Temporal Relational Network for Action Detection [7.616556723260849]
We introduce an end-to-end network: Class-Temporal Network (CTRN)
CTRN contains three key components: The Transform Representation Module, the Class-Temporal Module and the G-classifier.
We evaluate CTR on three densely labelled datasets and achieve state-of-the-art performance.
arXiv Detail & Related papers (2021-10-26T08:15:47Z) - Learning Modality Interaction for Temporal Sentence Localization and
Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos.
Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z) - Learning End-to-End Action Interaction by Paired-Embedding Data
Augmentation [10.857323240766428]
A new Interactive Action Translation (IAT) task aims to learn end-to-end action interaction from unlabeled interactive pairs.
We propose a Paired-Embedding (PE) method for effective and reliable data augmentation.
Experimental results on two datasets show impressive effects and broad application prospects of our method.
arXiv Detail & Related papers (2020-07-16T01:54:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.