MH-DETR: Video Moment and Highlight Detection with Cross-modal
Transformer
- URL: http://arxiv.org/abs/2305.00355v1
- Date: Sat, 29 Apr 2023 22:50:53 GMT
- Title: MH-DETR: Video Moment and Highlight Detection with Cross-modal
Transformer
- Authors: Yifang Xu, Yunzhuo Sun, Yang Li, Yilei Shi, Xiaoxiang Zhu, Sidan Du
- Abstract summary: We propose MH-DETR (Moment and Highlight Detection Transformer) tailored for video moment and highlight detection (MHD)
We introduce a simple yet efficient pooling operator within the uni-modal encoder to capture global intra-modal context.
To obtain temporally aligned cross-modal features, we design a plug-and-play cross-modal interaction module between the encoder and decoder.
- Score: 17.29632719667594
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the increasing demand for video understanding, video moment and
highlight detection (MHD) has emerged as a critical research topic. MHD aims to
localize all moments and predict clip-wise saliency scores simultaneously.
Despite progress made by existing DETR-based methods, we observe that these
methods coarsely fuse features from different modalities, which weakens the
temporal intra-modal context and results in insufficient cross-modal
interaction. To address this issue, we propose MH-DETR (Moment and Highlight
Detection Transformer) tailored for MHD. Specifically, we introduce a simple
yet efficient pooling operator within the uni-modal encoder to capture global
intra-modal context. Moreover, to obtain temporally aligned cross-modal
features, we design a plug-and-play cross-modal interaction module between the
encoder and decoder, seamlessly integrating visual and textual features.
Comprehensive experiments on QVHighlights, Charades-STA, Activity-Net, and
TVSum datasets show that MH-DETR outperforms existing state-of-the-art methods,
demonstrating its effectiveness and superiority. Our code is available at
https://github.com/YoucanBaby/MH-DETR.
Related papers
- DMM: Disparity-guided Multispectral Mamba for Oriented Object Detection in Remote Sensing [8.530409994516619]
Multispectral oriented object detection faces challenges due to both inter-modal and intra-modal discrepancies.
We propose Disparity-guided Multispectral Mamba (DMM), a framework comprised of a Disparity-guided Cross-modal Fusion Mamba (DCFM) module, a Multi-scale Target-aware Attention (MTA) module, and a Target-Prior Aware (TPA) auxiliary task.
arXiv Detail & Related papers (2024-07-11T02:09:59Z) - TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and
Highlight Detection [9.032057312774564]
Video moment retrieval (MR) and highlight detection (HD) based on natural language queries are two highly related tasks.
Several methods have been devoted to building DETR-based networks to solve both MR and HD jointly.
We propose a task-reciprocal transformer based on DETR (TR-DETR) that focuses on exploring the inherent reciprocity between MR and HD.
arXiv Detail & Related papers (2024-01-04T14:55:57Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - UMMAFormer: A Universal Multimodal-adaptive Transformer Framework for
Temporal Forgery Localization [16.963092523737593]
We propose a novel framework for temporal forgery localization (TFL) that predicts forgery segments with multimodal adaptation.
Our approach achieves state-of-the-art performance on benchmark datasets, including Lav-DF, TVIL, and Psynd.
arXiv Detail & Related papers (2023-08-28T08:20:30Z) - MRTNet: Multi-Resolution Temporal Network for Video Sentence Grounding [70.82093938170051]
We propose a novel multi-resolution temporal video sentence grounding network: MRTNet.
MRTNet consists of a multi-modal feature encoder, a Multi-Resolution Temporal (MRT) module, and a predictor module.
Our MRT module is hot-pluggable, which means it can be seamlessly incorporated into any anchor-free models.
arXiv Detail & Related papers (2022-12-26T13:48:05Z) - AntPivot: Livestream Highlight Detection via Hierarchical Attention
Mechanism [64.70568612993416]
We formulate a new task Livestream Highlight Detection, discuss and analyze the difficulties listed above and propose a novel architecture AntPivot to solve this problem.
We construct a fully-annotated dataset AntHighlight to instantiate this task and evaluate the performance of our model.
arXiv Detail & Related papers (2022-06-10T05:58:11Z) - Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal
Grounding [78.71529237748018]
Grounding temporal video segments described in natural language queries effectively and efficiently is a crucial capability needed in vision-and-language fields.
Most existing approaches adopt elaborately designed cross-modal interaction modules to improve the grounding performance.
We propose a commonsense-aware cross-modal alignment framework, which incorporates commonsense-guided visual and text representations into a complementary common space.
arXiv Detail & Related papers (2022-04-04T13:07:05Z) - TubeDETR: Spatio-Temporal Video Grounding with Transformers [89.71617065426146]
We consider the problem of encoder localizing a-temporal tube in a video corresponding to a given text query.
To address this task, we propose TubeDETR, a transformer- architecture inspired by the recent success of such models for text-conditioned object detection.
arXiv Detail & Related papers (2022-03-30T16:31:49Z) - UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval
and Highlight Detection [46.25856560381347]
We present the first unified framework, named Unified Multi-modal Transformers (UMT)
UMT is capable of realizing such joint optimization while can also be easily degenerated for solving individual problems.
As far as we are aware, this is the first scheme to integrate multi-modal (visual-audio) learning for either joint optimization or the individual moment retrieval task.
arXiv Detail & Related papers (2022-03-23T22:11:43Z) - Cross-Modality Fusion Transformer for Multispectral Object Detection [0.0]
Multispectral image pairs can provide the combined information, making object detection applications more reliable and robust.
We present a simple yet effective cross-modality feature fusion approach, named Cross-Modality Fusion Transformer (CFT) in this paper.
arXiv Detail & Related papers (2021-10-30T15:34:12Z) - Deconfounded Video Moment Retrieval with Causal Intervention [80.90604360072831]
We tackle the task of video moment retrieval (VMR), which aims to localize a specific moment in a video according to a textual query.
Existing methods primarily model the matching relationship between query and moment by complex cross-modal interactions.
We propose a causality-inspired VMR framework that builds structural causal model to capture the true effect of query and video content on the prediction.
arXiv Detail & Related papers (2021-06-03T01:33:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.