Context-Aware Pseudo-Label Scoring for Zero-Shot Video Summarization
- URL: http://arxiv.org/abs/2510.17501v3
- Date: Wed, 22 Oct 2025 17:54:43 GMT
- Title: Context-Aware Pseudo-Label Scoring for Zero-Shot Video Summarization
- Authors: Yuanli Wu, Long Zhang, Yue Du, Bin Li,
- Abstract summary: We propose a rubric-guided, pseudo-labeled, and prompt-driven zero-shot video summarization framework.<n>A small subset of human annotations is converted into high-confidence pseudo labels.<n>During inference, boundary scenes are scored independently based on their own descriptions.
- Score: 6.057968525653529
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a rubric-guided, pseudo-labeled, and prompt-driven zero-shot video summarization framework that bridges large language models with structured semantic reasoning. A small subset of human annotations is converted into high-confidence pseudo labels and organized into dataset-adaptive rubrics defining clear evaluation dimensions such as thematic relevance, action detail, and narrative progression. During inference, boundary scenes, including the opening and closing segments, are scored independently based on their own descriptions, while intermediate scenes incorporate concise summaries of adjacent segments to assess narrative continuity and redundancy. This design enables the language model to balance local salience with global coherence without any parameter tuning. Across three benchmarks, the proposed method achieves stable and competitive results, with F1 scores of 57.58 on SumMe, 63.05 on TVSum, and 53.79 on QFVS, surpassing zero-shot baselines by +0.85, +0.84, and +0.37, respectively. These outcomes demonstrate that rubric-guided pseudo labeling combined with contextual prompting effectively stabilizes LLM-based scoring and establishes a general, interpretable, and training-free paradigm for both generic and query-focused video summarization.
Related papers
- Less is More: Label-Guided Summarization of Procedural and Instructional Videos [21.13311741987469]
We propose a three-stage framework, PRISM: Procedural Representation via Integrated Semantic and Multimodal analysis.<n>We analyze adaptive visual sampling, label-driven anchoring, and contextual validation using a large language model (LLM)<n>Our approach generalizes across procedural and domain-specific video tasks, achieving strong performance with both semantic alignment and precision.
arXiv Detail & Related papers (2026-01-18T03:41:48Z) - SPOT: An Annotated French Corpus and Benchmark for Detecting Critical Interventions in Online Conversations [10.409447852574907]
SPOT is the first annotated corpus translating the sociological concept of stopping point into a reproducible NLP task.<n>The corpus contains 43,305 manually annotated French Facebook comments linked to URLs flagged as false information by social media users.<n>We benchmark fine-tuned encoder models (CamemBERT) and instruction-tuned LLMs under various prompting strategies.
arXiv Detail & Related papers (2025-11-10T18:54:40Z) - HierSum: A Global and Local Attention Mechanism for Video Summarization [14.88934924520362]
We focus on summarizing instructional videos and propose a method for breaking down a video into meaningful segments.<n>HierSum integrates fine-grained local cues from subtitles with global contextual information provided by video-level instructions.<n>We show that HierSum consistently outperforms existing methods in key metrics such as F1-score and rank correlation.
arXiv Detail & Related papers (2025-04-25T20:30:30Z) - GUM-SAGE: A Novel Dataset and Approach for Graded Entity Salience Prediction [12.172254885579706]
Graded entity salience assigns entities scores that reflect their relative importance in a text.<n>We introduce a novel approach for graded entity salience that combines the strengths of both approaches.<n>Our approach shows stronger correlation with scores based on human summaries and alignments, and outperforms existing techniques.
arXiv Detail & Related papers (2025-04-15T01:26:14Z) - Collaborative Temporal Consistency Learning for Point-supervised Natural Language Video Localization [129.43937834515688]
We propose a new COllaborative Temporal consistEncy Learning (COTEL) framework to strengthen the video-language alignment.<n>Specifically, we first design a frame- and a segment-level Temporal Consistency Learning (TCL) module that models semantic alignment across frame saliencies and sentence-moment pairs.
arXiv Detail & Related papers (2025-03-22T05:04:12Z) - Towards Enhancing Coherence in Extractive Summarization: Dataset and Experiments with LLMs [70.15262704746378]
We propose a systematically created human-annotated dataset consisting of coherent summaries for five publicly available datasets and natural language user feedback.
Preliminary experiments with Falcon-40B and Llama-2-13B show significant performance improvements (10% Rouge-L) in terms of producing coherent summaries.
arXiv Detail & Related papers (2024-07-05T20:25:04Z) - Semi-Supervised Dialogue Abstractive Summarization via High-Quality
Pseudolabel Selection [27.531083525683243]
Semi-supervised dialogue summarization (SSDS) leverages model-generated summaries to reduce reliance on human-labeled data.
We propose a novel scoring approach, SiCF, which encapsulates three primary dimensions of summarization model quality.
arXiv Detail & Related papers (2024-03-06T22:06:23Z) - SWING: Balancing Coverage and Faithfulness for Dialogue Summarization [67.76393867114923]
We propose to utilize natural language inference (NLI) models to improve coverage while avoiding factual inconsistencies.
We use NLI to compute fine-grained training signals to encourage the model to generate content in the reference summaries that have not been covered.
Experiments on the DialogSum and SAMSum datasets confirm the effectiveness of the proposed approach.
arXiv Detail & Related papers (2023-01-25T09:33:11Z) - Evaluating the Factual Consistency of Large Language Models Through News
Summarization [97.04685401448499]
We propose a new benchmark called FIB(Factual Inconsistency Benchmark) that focuses on the task of summarization.
For factually consistent summaries, we use human-written reference summaries that we manually verify as factually consistent.
For factually inconsistent summaries, we generate summaries from a suite of summarization models that we have manually annotated as factually inconsistent.
arXiv Detail & Related papers (2022-11-15T18:50:34Z) - COLO: A Contrastive Learning based Re-ranking Framework for One-Stage
Summarization [84.70895015194188]
We propose a Contrastive Learning based re-ranking framework for one-stage summarization called COLO.
COLO boosts the extractive and abstractive results of one-stage systems on CNN/DailyMail benchmark to 44.58 and 46.33 ROUGE-1 score.
arXiv Detail & Related papers (2022-09-29T06:11:21Z) - CLIP-It! Language-Guided Video Summarization [96.69415453447166]
This work introduces CLIP-It, a single framework for addressing both generic and query-focused video summarization.
We propose a language-guided multimodal transformer that learns to score frames in a video based on their importance relative to one another.
Our model can be extended to the unsupervised setting by training without ground-truth supervision.
arXiv Detail & Related papers (2021-07-01T17:59:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.