Related papers: Less is More: Label-Guided Summarization of Procedural and Instructional Videos

Less is More: Label-Guided Summarization of Procedural and Instructional Videos

URL: http://arxiv.org/abs/2601.12243v1
Date: Sun, 18 Jan 2026 03:41:48 GMT
Title: Less is More: Label-Guided Summarization of Procedural and Instructional Videos
Authors: Shreya Rajpal, Michal Golovanesky, Carsten Eickhoff,
Abstract summary: We propose a three-stage framework, PRISM: Procedural Representation via Integrated Semantic and Multimodal analysis.<n>We analyze adaptive visual sampling, label-driven anchoring, and contextual validation using a large language model (LLM)<n>Our approach generalizes across procedural and domain-specific video tasks, achieving strong performance with both semantic alignment and precision.
Score: 21.13311741987469
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video summarization helps turn long videos into clear, concise representations that are easier to review, document, and analyze, especially in high-stakes domains like surgical training. Prior work has progressed from using basic visual features like color, motion, and structural changes to using pre-trained vision-language models that can better understand what's happening in the video (semantics) and capture temporal flow, resulting in more context-aware video summarization. We propose a three-stage framework, PRISM: Procedural Representation via Integrated Semantic and Multimodal analysis, that produces semantically grounded video summaries. PRISM combines adaptive visual sampling, label-driven keyframe anchoring, and contextual validation using a large language model (LLM). Our method ensures that selected frames reflect meaningful and procedural transitions while filtering out generic or hallucinated content, resulting in contextually coherent summaries across both domain-specific and instructional videos. We evaluate our method on instructional and activity datasets, using reference summaries for instructional videos. Despite sampling fewer than 5% of the original frames, our summaries retain 84% semantic content while improving over baselines by as much as 33%. Our approach generalizes across procedural and domain-specific video tasks, achieving strong performance with both semantic alignment and precision.

Related papers

HierSum: A Global and Local Attention Mechanism for Video Summarization [14.88934924520362]
We focus on summarizing instructional videos and propose a method for breaking down a video into meaningful segments.<n>HierSum integrates fine-grained local cues from subtitles with global contextual information provided by video-level instructions.<n>We show that HierSum consistently outperforms existing methods in key metrics such as F1-score and rank correlation.
arXiv Detail & Related papers (2025-04-25T20:30:30Z)
Video Summarization with Large Language Models [41.51242348081083]
We propose a new video summarization framework that leverages the capabilities of recent Large Language Models (LLMs)<n>Our method, dubbed LLM-based Video Summarization (LLMVS), translates video frames into a sequence of captions using a Muti-modal Large Language Model (MLLM)<n>Our experimental results demonstrate the superiority of the proposed method over existing ones in standard benchmarks.
arXiv Detail & Related papers (2025-04-15T13:56:14Z)
Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment [53.12952107996463]
This work proposes a novel training framework for learning to localize temporal boundaries of procedure steps in training videos. Motivated by the strong capabilities of Large Language Models (LLMs) in procedure understanding and text summarization, we first apply an LLM to filter out task-irrelevant information and summarize task-related procedure steps from narrations. To further generate reliable pseudo-matching between the LLM-steps and the video for training, we propose the Multi-Pathway Text-Video Alignment (MPTVA) strategy.
arXiv Detail & Related papers (2024-09-22T18:40:55Z)
Language-Guided Self-Supervised Video Summarization Using Text Semantic Matching Considering the Diversity of the Video [22.60291297308379]
We investigate the feasibility in transforming the video summarization task into a Natural Language Processing (NLP) task. Our method achieves state-of-the-art performance on the SumMe dataset in rank correlation coefficients.
arXiv Detail & Related papers (2024-05-14T18:07:04Z)
Multi-Sentence Grounding for Long-term Instructional Video [63.27905419718045]
We aim to establish an automatic, scalable pipeline for denoising a large-scale instructional dataset. We construct a high-quality video-text dataset with multiple descriptive steps supervision, named HowToStep.
arXiv Detail & Related papers (2023-12-21T17:28:09Z)
Conditional Modeling Based Automatic Video Summarization [70.96973928590958]
The aim of video summarization is to shorten videos automatically while retaining the key information necessary to convey the overall story. Video summarization methods rely on visual factors, such as visual consecutiveness and diversity, which may not be sufficient to fully understand the content of the video. A new approach to video summarization is proposed based on insights gained from how humans create ground truth video summaries.
arXiv Detail & Related papers (2023-11-20T20:24:45Z)
Learning to Ground Instructional Articles in Videos through Narrations [50.3463147014498]
We present an approach for localizing steps of procedural activities in narrated how-to videos. We source the step descriptions from a language knowledge base (wikiHow) containing instructional articles. Our model learns to temporally ground the steps of procedural articles in how-to videos by matching three modalities.
arXiv Detail & Related papers (2023-06-06T15:45:53Z)
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations [54.62547989034184]
We propose Expectation-Maximization Contrastive Learning (EMCL) to learn compact video-and-language representations. Specifically, we use the Expectation-Maximization algorithm to find a compact set of bases for the latent space. Experiments on three benchmark text-video retrieval datasets prove that our EMCL can learn more discriminative video-and-language representations.
arXiv Detail & Related papers (2022-11-21T13:12:44Z)
TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency [133.75876535332003]
We focus on summarizing instructional videos, an under-explored area of video summarization. Existing video summarization datasets rely on manual frame-level annotations. We propose an instructional video summarization network that combines a context-aware temporal video encoder and a segment scoring transformer.
arXiv Detail & Related papers (2022-08-14T04:07:40Z)
Video Summarization Based on Video-text Modelling [0.0]
We propose a multimodal self-supervised learning framework to obtain semantic representations of videos. We also introduce a progressive video summarization method, where the important content in a video is pinpointed progressively to generate better summaries. An objective evaluation framework is proposed to measure the quality of video summaries based on video classification.
arXiv Detail & Related papers (2022-01-07T15:21:46Z)
CLIP-It! Language-Guided Video Summarization [96.69415453447166]
This work introduces CLIP-It, a single framework for addressing both generic and query-focused video summarization. We propose a language-guided multimodal transformer that learns to score frames in a video based on their importance relative to one another. Our model can be extended to the unsupervised setting by training without ground-truth supervision.
arXiv Detail & Related papers (2021-07-01T17:59:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.