CFSum: A Transformer-Based Multi-Modal Video Summarization Framework With Coarse-Fine Fusion
- URL: http://arxiv.org/abs/2503.00364v1
- Date: Sat, 01 Mar 2025 06:13:13 GMT
- Title: CFSum: A Transformer-Based Multi-Modal Video Summarization Framework With Coarse-Fine Fusion
- Authors: Yaowei Guo, Jiazheng Xing, Xiaojun Hou, Shuo Xin, Juntao Jiang, Demetri Terzopoulos, Chenfanfu Jiang, Yong Liu,
- Abstract summary: CFSum is a transformer-based multi-modal video summarization framework with coarse-fine fusion.<n> CFSum exploits video, text, and audio modal features as input, and incorporates a two-stage transformer-based feature fusion framework.
- Score: 22.58710742780161
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video summarization, by selecting the most informative and/or user-relevant parts of original videos to create concise summary videos, has high research value and consumer demand in today's video proliferation era. Multi-modal video summarization that accomodates user input has become a research hotspot. However, current multi-modal video summarization methods suffer from two limitations. First, existing methods inadequately fuse information from different modalities and cannot effectively utilize modality-unique features. Second, most multi-modal methods focus on video and text modalities, neglecting the audio modality, despite the fact that audio information can be very useful in certain types of videos. In this paper we propose CFSum, a transformer-based multi-modal video summarization framework with coarse-fine fusion. CFSum exploits video, text, and audio modal features as input, and incorporates a two-stage transformer-based feature fusion framework to fully utilize modality-unique information. In the first stage, multi-modal features are fused simultaneously to perform initial coarse-grained feature fusion, then, in the second stage, video and audio features are explicitly attended with the text representation yielding more fine-grained information interaction. The CFSum architecture gives equal importance to each modality, ensuring that each modal feature interacts deeply with the other modalities. Our extensive comparative experiments against prior methods and ablation studies on various datasets confirm the effectiveness and superiority of CFSum.
Related papers
- VideoFusion: A Spatio-Temporal Collaborative Network for Mutli-modal Video Fusion and Restoration [26.59510171451438]
Existing multi-sensor fusion research predominantly integrates complementary from multiple images rather than videos.
VideoFusion exploits cross-modal complementarity and temporal dynamics to generate context-temporally coherent videos.
Extensive experiments reveal that VideoFusion outperforms existing image-oriented fusion paradigms in sequential scenarios.
arXiv Detail & Related papers (2025-03-30T08:27:18Z) - VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining.
We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts.
We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z) - Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities [67.89368528234394]
One of the main challenges of multimodal learning is the need to combine heterogeneous modalities.
Video and audio are obtained at much higher rates than text and are roughly aligned in time.
Our approach achieves the state-of-the-art on well established multimodal benchmarks, outperforming much larger models.
arXiv Detail & Related papers (2023-11-09T19:15:12Z) - Causal Video Summarizer for Video Exploration [74.27487067877047]
Causal Video Summarizer (CVS) is proposed to capture the interactive information between the video and query.
Based on the evaluation of the existing multi-modal video summarization dataset, experimental results show that the proposed approach is effective.
arXiv Detail & Related papers (2023-07-04T22:52:16Z) - M2HF: Multi-level Multi-modal Hybrid Fusion for Text-Video Retrieval [34.343617836027725]
We propose a multi-level multi-modal hybrid fusion network to explore comprehensive interactions between text queries and each modality content in videos.
Our framework provides two kinds of training strategies, including an ensemble manner and an end-to-end manner.
arXiv Detail & Related papers (2022-08-16T10:51:37Z) - Multimodal Frame-Scoring Transformer for Video Summarization [4.266320191208304]
Multimodal Frame-Scoring Transformer (MFST) framework exploiting visual, text and audio features and scoring a video with respect to frames.
MFST framework first extracts each modality features (visual-text-audio) using pretrained encoders.
MFST trains the multimodal frame-scoring transformer that uses video-text-audio representations as inputs and predicts frame-level scores.
arXiv Detail & Related papers (2022-07-05T05:14:15Z) - Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval [36.50847375135979]
Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation.
We present a multi-modal, modality fusion transformer approach that learns to exchange information between multiple modalities, such as video, audio, and text, and integrate them into a joined multi-modal representation.
arXiv Detail & Related papers (2021-12-08T18:14:57Z) - Attention Bottlenecks for Multimodal Fusion [90.75885715478054]
Machine perception models are typically modality-specific and optimised for unimodal benchmarks.
We introduce a novel transformer based architecture that uses fusion' for modality fusion at multiple layers.
We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks.
arXiv Detail & Related papers (2021-06-30T22:44:12Z) - AudioVisual Video Summarization [103.47766795086206]
In video summarization, existing approaches just exploit the visual information while neglecting the audio information.
We propose to jointly exploit the audio and visual information for the video summarization task, and develop an AudioVisual Recurrent Network (AVRN) to achieve this.
arXiv Detail & Related papers (2021-05-17T08:36:10Z) - GPT2MVS: Generative Pre-trained Transformer-2 for Multi-modal Video
Summarization [18.543372365239673]
The proposed model consists of a contextualized video summary controller, multi-modal attention mechanisms, an interactive attention network, and a video summary generator.
Results show that the proposed model is effective with the increase of +5.88% in accuracy and +4.06% increase of F1-score, compared with the state-of-the-art method.
arXiv Detail & Related papers (2021-04-26T10:50:37Z) - Multi-modal Transformer for Video Retrieval [67.86763073161012]
We present a multi-modal transformer to jointly encode the different modalities in video.
On the natural language side, we investigate the best practices to jointly optimize the language embedding together with the multi-modal transformer.
This novel framework allows us to establish state-of-the-art results for video retrieval on three datasets.
arXiv Detail & Related papers (2020-07-21T07:38:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.