TripleSumm: Adaptive Triple-Modality Fusion for Video Summarization
- URL: http://arxiv.org/abs/2603.01169v1
- Date: Sun, 01 Mar 2026 16:18:59 GMT
- Title: TripleSumm: Adaptive Triple-Modality Fusion for Video Summarization
- Authors: Sumin Kim, Hyemin Jeong, Mingu Kang, Yejin Kim, Yoori Oh, Joonseok Lee,
- Abstract summary: TripleSumm is a novel architecture that adaptively weights and fuses the contributions of visual, text, and audio modalities at the frame level.<n>It achieves state-of-the-art performance, outperforming existing methods by a significant margin on four benchmarks, including MoSu.
- Score: 19.94299183056601
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The exponential growth of video content necessitates effective video summarization to efficiently extract key information from long videos. However, current approaches struggle to fully comprehend complex videos, primarily because they employ static or modality-agnostic fusion strategies. These methods fail to account for the dynamic, frame-dependent variations in modality saliency inherent in video data. To overcome these limitations, we propose TripleSumm, a novel architecture that adaptively weights and fuses the contributions of visual, text, and audio modalities at the frame level. Furthermore, a significant bottleneck for research into multimodal video summarization has been the lack of comprehensive benchmarks. Addressing this bottleneck, we introduce MoSu (Most Replayed Multimodal Video Summarization), the first large-scale benchmark that provides all three modalities. Extensive experiments demonstrate that TripleSumm achieves state-of-the-art performance, outperforming existing methods by a significant margin on four benchmarks, including MoSu. Our code and dataset are available at https://github.com/smkim37/TripleSumm.
Related papers
- UniMMVSR: A Unified Multi-Modal Framework for Cascaded Video Super-Resolution [62.10676832966289]
Cascaded video super-resolution has emerged as a promising technique for generating high-resolution videos using large foundation models.<n>We present UniMMVSR, the first unified generative video super-resolution framework to incorporate hybrid-modal conditions, including text, images, and videos.<n>Our experiments demonstrate that UniMMVSR significantly outperforms existing methods, producing videos with superior detail and a higher degree of conformity to multi-modal conditions.
arXiv Detail & Related papers (2025-10-09T12:25:16Z) - TriPSS: A Tri-Modal Keyframe Extraction Framework Using Perceptual, Structural, and Semantic Representations [0.31224081969539713]
TriPSS is a tri-modal framework that integrates perceptual features from the CIELAB color space, structural embeddings from ResNet-50, and semantic context from frame-level captions.<n>TriPSS achieves state-of-the-art performance, significantly outperforming both unimodal and prior multimodal approaches.
arXiv Detail & Related papers (2025-06-03T19:44:49Z) - SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models [93.73583158211115]
Achieving fine-grained-temporal understanding in videos remains a major challenge for current Video Large Multimodels (Video LMMs)<n>We contribute in three core aspects: dataset, model, and benchmark.<n>First, we introduce SAMA-239K, a large-scale dataset comprising 15K videos specifically to enable joint learning of video understanding, grounding, and multi-turn video chat.<n>Second, we propose the SAMA model, which incorporates a versatile-temporal context aggregator and a Segment Model to jointly enhance fine-grained video comprehension and precise grounding capabilities.
arXiv Detail & Related papers (2025-05-24T18:13:16Z) - CFSum: A Transformer-Based Multi-Modal Video Summarization Framework With Coarse-Fine Fusion [22.58710742780161]
CFSum is a transformer-based multi-modal video summarization framework with coarse-fine fusion.<n> CFSum exploits video, text, and audio modal features as input, and incorporates a two-stage transformer-based feature fusion framework.
arXiv Detail & Related papers (2025-03-01T06:13:13Z) - Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding [61.89781979702939]
This study quantitatively reveals an "impossible trinity" among data quantity, diversity, and quality in pre-training datasets.
Recent efforts seek to refine large-scale, diverse ASR datasets compromised by low quality through synthetic annotations.
We introduce the Video DataFlywheel framework, which iteratively refines video annotations with improved noise control methods.
arXiv Detail & Related papers (2024-09-29T03:33:35Z) - UBiSS: A Unified Framework for Bimodal Semantic Summarization of Videos [52.161513027831646]
We focus on a more comprehensive video summarization task named Bimodal Semantic Summarization of Videos (BiSSV)
We propose a Unified framework UBiSS for the BiSSV task, which models the saliency information in the video and generates a TM-summary and VM-summary simultaneously.
Experiments show that our unified framework achieves better performance than multi-stage summarization pipelines.
arXiv Detail & Related papers (2024-06-24T03:55:25Z) - V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning [65.30592368928723]
Video summarization aims to create short, accurate, and cohesive summaries of longer videos.<n>Most existing datasets are created for video-to-video summarization.<n>Recent efforts have been made to expand from unimodal to multimodal video summarization.
arXiv Detail & Related papers (2024-04-18T17:32:46Z) - Causal Video Summarizer for Video Exploration [74.27487067877047]
Causal Video Summarizer (CVS) is proposed to capture the interactive information between the video and query.
Based on the evaluation of the existing multi-modal video summarization dataset, experimental results show that the proposed approach is effective.
arXiv Detail & Related papers (2023-07-04T22:52:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.