TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and
Highlight Detection
- URL: http://arxiv.org/abs/2401.02309v2
- Date: Fri, 5 Jan 2024 03:11:28 GMT
- Title: TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and
Highlight Detection
- Authors: Hao Sun, Mingyao Zhou, Wenjing Chen, Wei Xie
- Abstract summary: Video moment retrieval (MR) and highlight detection (HD) based on natural language queries are two highly related tasks.
Several methods have been devoted to building DETR-based networks to solve both MR and HD jointly.
We propose a task-reciprocal transformer based on DETR (TR-DETR) that focuses on exploring the inherent reciprocity between MR and HD.
- Score: 9.032057312774564
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video moment retrieval (MR) and highlight detection (HD) based on natural
language queries are two highly related tasks, which aim to obtain relevant
moments within videos and highlight scores of each video clip. Recently,
several methods have been devoted to building DETR-based networks to solve both
MR and HD jointly. These methods simply add two separate task heads after
multi-modal feature extraction and feature interaction, achieving good
performance. Nevertheless, these approaches underutilize the reciprocal
relationship between two tasks. In this paper, we propose a task-reciprocal
transformer based on DETR (TR-DETR) that focuses on exploring the inherent
reciprocity between MR and HD. Specifically, a local-global multi-modal
alignment module is first built to align features from diverse modalities into
a shared latent space. Subsequently, a visual feature refinement is designed to
eliminate query-irrelevant information from visual features for modal
interaction. Finally, a task cooperation module is constructed to refine the
retrieval pipeline and the highlight score prediction process by utilizing the
reciprocity between MR and HD. Comprehensive experiments on QVHighlights,
Charades-STA and TVSum datasets demonstrate that TR-DETR outperforms existing
state-of-the-art methods. Codes are available at
\url{https://github.com/mingyao1120/TR-DETR}.
Related papers
- Task-Driven Exploration: Decoupling and Inter-Task Feedback for Joint Moment Retrieval and Highlight Detection [7.864892339833315]
We propose a novel task-driven top-down framework for joint moment retrieval and highlight detection.
The framework introduces a task-decoupled unit to capture task-specific and common representations.
Comprehensive experiments and in-depth ablation studies on QVHighlights, TVSum, and Charades-STA datasets corroborate the effectiveness and flexibility of the proposed framework.
arXiv Detail & Related papers (2024-04-14T14:06:42Z) - Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement [72.7576395034068]
Video Corpus Moment Retrieval (VCMR) is a new video retrieval task aimed at retrieving a relevant moment from a large corpus of untrimmed videos using a text query.
We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task.
For video retrieval, we introduce a multi-modal collaborative video retriever, generating different query representations for the two modalities.
For moment localization, we propose the focus-then-fuse moment localizer, utilizing modality-specific gates to capture essential content.
arXiv Detail & Related papers (2024-02-21T07:16:06Z) - MH-DETR: Video Moment and Highlight Detection with Cross-modal
Transformer [17.29632719667594]
We propose MH-DETR (Moment and Highlight Detection Transformer) tailored for video moment and highlight detection (MHD)
We introduce a simple yet efficient pooling operator within the uni-modal encoder to capture global intra-modal context.
To obtain temporally aligned cross-modal features, we design a plug-and-play cross-modal interaction module between the encoder and decoder.
arXiv Detail & Related papers (2023-04-29T22:50:53Z) - HyRSM++: Hybrid Relation Guided Temporal Set Matching for Few-shot
Action Recognition [51.2715005161475]
We propose a novel Hybrid Relation guided temporal Set Matching approach for few-shot action recognition.
The core idea of HyRSM++ is to integrate all videos within the task to learn discriminative representations.
We show that our method achieves state-of-the-art performance under various few-shot settings.
arXiv Detail & Related papers (2023-01-09T13:32:50Z) - PSNet: Parallel Symmetric Network for Video Salient Object Detection [85.94443548452729]
We propose a VSOD network with up and down parallel symmetry, named PSNet.
Two parallel branches with different dominant modalities are set to achieve complete video saliency decoding.
arXiv Detail & Related papers (2022-10-12T04:11:48Z) - Exploring Motion and Appearance Information for Temporal Sentence
Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding.
We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations.
Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z) - MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection [37.25262046781015]
Action detection is an essential and challenging task, especially for densely labelled datasets of untrimmed videos.
We propose a novel ConvTransformer network for action detection that efficiently captures both short-term and long-term temporal information.
Our network outperforms the state-of-the-art methods on all three datasets.
arXiv Detail & Related papers (2021-12-07T18:57:37Z) - Improving Video Instance Segmentation via Temporal Pyramid Routing [61.10753640148878]
Video Instance (VIS) is a new and inherently multi-task problem, which aims to detect, segment and track each instance in a video sequence.
We propose a Temporal Pyramid Routing (TPR) strategy to conditionally align and conduct pixel-level aggregation from a feature pyramid pair of two adjacent frames.
Our approach is a plug-and-play module and can be easily applied to existing instance segmentation methods.
arXiv Detail & Related papers (2021-07-28T03:57:12Z) - One Network to Solve Them All: A Sequential Multi-Task Joint Learning
Network Framework for MR Imaging Pipeline [12.684219884940056]
A sequential multi-task joint learning network model is proposed to train a combined end-to-end pipeline.
The proposed framework is verified on MRB dataset, which achieves superior performance on other SOTA methods in terms of both reconstruction and segmentation.
arXiv Detail & Related papers (2021-05-14T05:55:27Z) - MuCAN: Multi-Correspondence Aggregation Network for Video
Super-Resolution [63.02785017714131]
Video super-resolution (VSR) aims to utilize multiple low-resolution frames to generate a high-resolution prediction for each frame.
Inter- and intra-frames are the key sources for exploiting temporal and spatial information.
We build an effective multi-correspondence aggregation network (MuCAN) for VSR.
arXiv Detail & Related papers (2020-07-23T05:41:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.