Related papers: TriPSS: A Tri-Modal Keyframe Extraction Framework Using Perceptual, Structural, and Semantic Representations

TriPSS: A Tri-Modal Keyframe Extraction Framework Using Perceptual, Structural, and Semantic Representations

URL: http://arxiv.org/abs/2506.05395v2
Date: Tue, 02 Sep 2025 17:50:58 GMT
Title: TriPSS: A Tri-Modal Keyframe Extraction Framework Using Perceptual, Structural, and Semantic Representations
Authors: Mert Can Cakmak, Nitin Agarwal, Diwash Poudel,
Abstract summary: TriPSS is a tri-modal framework that integrates perceptual features from the CIELAB color space, structural embeddings from ResNet-50, and semantic context from frame-level captions.<n>TriPSS achieves state-of-the-art performance, significantly outperforming both unimodal and prior multimodal approaches.
Score: 0.31224081969539713
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Efficient keyframe extraction is critical for video summarization and retrieval, yet capturing the full semantic and visual richness of video content remains challenging. We introduce TriPSS, a tri-modal framework that integrates perceptual features from the CIELAB color space, structural embeddings from ResNet-50, and semantic context from frame-level captions generated by LLaMA-3.2-11B-Vision-Instruct. These modalities are fused using principal component analysis to form compact multi-modal embeddings, enabling adaptive video segmentation via HDBSCAN clustering. A refinement stage incorporating quality assessment and duplicate filtering ensures the final keyframe set is both concise and semantically diverse. Evaluations on the TVSum20 and SumMe benchmarks show that TriPSS achieves state-of-the-art performance, significantly outperforming both unimodal and prior multimodal approaches. These results highlight TriPSS' ability to capture complementary visual and semantic cues, establishing it as an effective solution for video summarization, retrieval, and large-scale multimedia understanding.

Related papers

Less is More: Label-Guided Summarization of Procedural and Instructional Videos [21.13311741987469]
We propose a three-stage framework, PRISM: Procedural Representation via Integrated Semantic and Multimodal analysis.<n>We analyze adaptive visual sampling, label-driven anchoring, and contextual validation using a large language model (LLM)<n>Our approach generalizes across procedural and domain-specific video tasks, achieving strong performance with both semantic alignment and precision.
arXiv Detail & Related papers (2026-01-18T03:41:48Z)
From Captions to Keyframes: KeyScore for Multimodal Frame Scoring and Video-Language Understanding [1.3856027745141806]
KeyScore is a caption-aware frame scoring method that combines semantic similarity to captions, temporal representativeness, and contextual drop impact.<n>Our method enhances both efficiency and performance, enabling scalable and caption-grounded video understanding.
arXiv Detail & Related papers (2025-10-07T23:02:27Z)
FrameMind: Frame-Interleaved Video Reasoning via Reinforcement Learning [65.42201665046505]
Current video understanding models rely on fixed frame sampling strategies, processing predetermined visual inputs regardless of the specific reasoning requirements of each question.<n>This static approach limits their ability to adaptively gather visual evidence, leading to suboptimal performance on tasks that require broad temporal coverage or fine-grained spatial detail.<n>We introduce FrameMind, an end-to-end framework trained with reinforcement learning that enables models to dynamically request visual information during reasoning through Frame-Interleaved Chain-of-Thought (FiCOT)<n>Unlike traditional approaches, FrameMind operates in multiple turns where the model alternates between textual reasoning and active visual perception, using tools to extract
arXiv Detail & Related papers (2025-09-28T17:59:43Z)
The Devil is in Temporal Token: High Quality Video Reasoning Segmentation [68.33080352141653]
Methods for Video Reasoning rely heavily on a single special token to represent the object in the video.<n>We propose VRS-HQ, an end-to-end video reasoning segmentation approach.<n>Our results highlight the strong temporal reasoning and segmentation capabilities of our method.
arXiv Detail & Related papers (2025-01-15T03:17:24Z)
Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning [56.873534081386]
A new topic HIREST is presented, including video retrieval, moment retrieval, moment segmentation, and step-captioning.<n>We propose a query-centric audio-visual cognition network to construct a reliable multi-modal representation for three tasks.<n>This can cognize user-preferred content and thus attain a query-centric audio-visual representation for three tasks.
arXiv Detail & Related papers (2024-12-18T06:43:06Z)
M3-CVC: Controllable Video Compression with Multimodal Generative Models [17.49397141459785]
M3-CVC is a controllable video compression framework incorporating generative models.<n>We show that M3-CVC significantly outperforms the state-the-art VVC standard in ultralow scenarios.
arXiv Detail & Related papers (2024-11-24T11:56:59Z)
UBiSS: A Unified Framework for Bimodal Semantic Summarization of Videos [52.161513027831646]
We focus on a more comprehensive video summarization task named Bimodal Semantic Summarization of Videos (BiSSV) We propose a Unified framework UBiSS for the BiSSV task, which models the saliency information in the video and generates a TM-summary and VM-summary simultaneously. Experiments show that our unified framework achieves better performance than multi-stage summarization pipelines.
arXiv Detail & Related papers (2024-06-24T03:55:25Z)
Low-Light Video Enhancement via Spatial-Temporal Consistent Decomposition [52.89441679581216]
Low-Light Video Enhancement (LLVE) seeks to restore dynamic or static scenes plagued by severe invisibility and noise.<n>We present an innovative video decomposition strategy that incorporates view-independent and view-dependent components.<n>Our framework consistently outperforms existing methods, establishing a new SOTA performance.
arXiv Detail & Related papers (2024-05-24T15:56:40Z)
DVIS++: Improved Decoupled Framework for Universal Video Segmentation [30.703276476607545]
We present OV-DVIS++, the first open-vocabulary universal video segmentation framework. By integrating CLIP with DVIS++, we present OV-DVIS++, the first open-vocabulary universal video segmentation framework.
arXiv Detail & Related papers (2023-12-20T03:01:33Z)
Joint Depth Prediction and Semantic Segmentation with Multi-View SAM [59.99496827912684]
We propose a Multi-View Stereo (MVS) technique for depth prediction that benefits from rich semantic features of the Segment Anything Model (SAM) This enhanced depth prediction, in turn, serves as a prompt to our Transformer-based semantic segmentation decoder.
arXiv Detail & Related papers (2023-10-31T20:15:40Z)
You Can Ground Earlier than See: An Effective and Efficient Pipeline for Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query. Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames. We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z)
MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for Video Summarization [61.69587867308656]
We propose a multimodal hierarchical shot-aware convolutional network, denoted as MHSCNet, to enhance the frame-wise representation. Based on the learned shot-aware representations, MHSCNet can predict the frame-level importance score in the local and global view of the video.
arXiv Detail & Related papers (2022-04-18T14:53:33Z)
Condensing a Sequence to One Informative Frame for Video Recognition [113.3056598548736]
This paper studies a two-step alternative that first condenses the video sequence to an informative "frame" A valid question is how to define "useful information" and then distill from a sequence down to one synthetic frame. IFS consistently demonstrates evident improvements on image-based 2D networks and clip-based 3D networks.
arXiv Detail & Related papers (2022-01-11T16:13:43Z)
DeepQAMVS: Query-Aware Hierarchical Pointer Networks for Multi-Video Summarization [127.16984421969529]
We introduce a novel Query-Aware Hierarchical Pointer Network for Multi-Video Summarization, termed DeepQAMVS. DeepQAMVS is trained with reinforcement learning, incorporating rewards that capture representativeness, diversity, query-adaptability and temporal coherence. We achieve state-of-the-art results on the MVS1K dataset, with inference time scaling linearly with the number of input video frames.
arXiv Detail & Related papers (2021-05-13T17:33:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.