Bridging the Gap: A Unified Video Comprehension Framework for Moment
Retrieval and Highlight Detection
- URL: http://arxiv.org/abs/2311.16464v1
- Date: Tue, 28 Nov 2023 03:55:23 GMT
- Title: Bridging the Gap: A Unified Video Comprehension Framework for Moment
Retrieval and Highlight Detection
- Authors: Yicheng Xiao, Zhuoyan Luo, Yong Liu, Yue Ma, Hengwei Bian, Yatai Ji,
Yujiu Yang, Xiu Li
- Abstract summary: Video Moment Retrieval (MR) and Highlight Detection (HD) have attracted significant attention due to the growing demand for video analysis.
Recent approaches treat MR and HD as similar video grounding problems and address them together with transformer-based architecture.
We propose a Unified Video COMprehension framework (UVCOM) to bridge the gap and jointly solve MR and HD effectively.
- Score: 45.82453232979516
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video Moment Retrieval (MR) and Highlight Detection (HD) have attracted
significant attention due to the growing demand for video analysis. Recent
approaches treat MR and HD as similar video grounding problems and address them
together with transformer-based architecture. However, we observe that the
emphasis of MR and HD differs, with one necessitating the perception of local
relationships and the other prioritizing the understanding of global contexts.
Consequently, the lack of task-specific design will inevitably lead to
limitations in associating the intrinsic specialty of two tasks. To tackle the
issue, we propose a Unified Video COMprehension framework (UVCOM) to bridge the
gap and jointly solve MR and HD effectively. By performing progressive
integration on intra and inter-modality across multi-granularity, UVCOM
achieves the comprehensive understanding in processing a video. Moreover, we
present multi-aspect contrastive learning to consolidate the local relation
modeling and global knowledge accumulation via well aligned multi-modal space.
Extensive experiments on QVHighlights, Charades-STA, TACoS , YouTube Highlights
and TVSum datasets demonstrate the effectiveness and rationality of UVCOM which
outperforms the state-of-the-art methods by a remarkable margin.
Related papers
- Dual-Hybrid Attention Network for Specular Highlight Removal [34.99543751199565]
Specular highlight removal plays a pivotal role in multimedia applications, as it enhances the quality and interpretability of images and videos.
Current state-of-the-art approaches often rely on additional priors or supervision, limiting their practicality and generalization capability.
We propose the Dual-Hybrid Attention Network for Specular Highlight Removal (DHAN-SHR), an end-to-end network that introduces novel hybrid attention mechanisms.
arXiv Detail & Related papers (2024-07-17T01:52:41Z) - REACT: Recognize Every Action Everywhere All At Once [8.10024991952397]
Group Activity Decoder (GAR) is a fundamental problem in computer vision, with diverse applications in sports analysis, surveillance, and social scene understanding.
We present REACT, an architecture inspired by the transformer encoder-decoder model.
Our method outperforms state-of-the-art GAR approaches in extensive experiments, demonstrating superior accuracy in recognizing and understanding group activities.
arXiv Detail & Related papers (2023-11-27T20:48:54Z) - Local-Global Associative Frame Assemble in Video Re-ID [57.7470971197962]
Noisy and unrepresentative frames in automatically generated object bounding boxes from video sequences cause challenges in learning discriminative representations in video re-identification (Re-ID)
Most existing methods tackle this problem by assessing the importance of video frames according to either their local part alignments or global appearance correlations separately.
In this work, we explore jointly both local alignments and global correlations with further consideration of their mutual promotion/reinforcement.
arXiv Detail & Related papers (2021-10-22T19:07:39Z) - Multi-Granularity Network with Modal Attention for Dense Affective
Understanding [11.076925361793556]
In the recent EEV challenge, a dense affective understanding task is proposed and requires frame-level affective prediction.
We propose a multi-granularity network with modal attention (MGN-MA), which employs multi-granularity features for better description of the target frame.
The proposed method achieves the correlation score of 0.02292 in the EEV challenge.
arXiv Detail & Related papers (2021-06-18T07:37:06Z) - An Efficient Recurrent Adversarial Framework for Unsupervised Real-Time
Video Enhancement [132.60976158877608]
We propose an efficient adversarial video enhancement framework that learns directly from unpaired video examples.
In particular, our framework introduces new recurrent cells that consist of interleaved local and global modules for implicit integration of spatial and temporal information.
The proposed design allows our recurrent cells to efficiently propagate-temporal-information across frames and reduces the need for high complexity networks.
arXiv Detail & Related papers (2020-12-24T00:03:29Z) - Exploring global diverse attention via pairwise temporal relation for
video summarization [84.28263235895798]
We propose an efficient convolutional neural network architecture for video SUMmarization via Global Diverse Attention.
The proposed models can be run in parallel with significantly less computational costs.
arXiv Detail & Related papers (2020-09-23T06:29:09Z) - Transforming Multi-Concept Attention into Video Summarization [36.85535624026879]
We propose a novel attention-based framework for video summarization with complex video data.
Our model can be applied to both labeled and unlabeled data, making our method preferable to real-world applications.
arXiv Detail & Related papers (2020-06-02T06:23:50Z) - Multi-Granularity Reference-Aided Attentive Feature Aggregation for
Video-based Person Re-identification [98.7585431239291]
Video-based person re-identification aims at matching the same person across video clips.
In this paper, we propose an attentive feature aggregation module, namely Multi-Granularity Reference-Attentive Feature aggregation module MG-RAFA.
Our framework achieves the state-of-the-art ablation performance on three benchmark datasets.
arXiv Detail & Related papers (2020-03-27T03:49:21Z) - See More, Know More: Unsupervised Video Object Segmentation with
Co-Attention Siamese Networks [184.4379622593225]
We introduce a novel network, called CO-attention Siamese Network (COSNet), to address the unsupervised video object segmentation task.
We emphasize the importance of inherent correlation among video frames and incorporate a global co-attention mechanism.
We propose a unified and end-to-end trainable framework where different co-attention variants can be derived for mining the rich context within videos.
arXiv Detail & Related papers (2020-01-19T11:10:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.