Fugu-MT 論文翻訳(概要): Uncertainty-Aware and Decoder-Aligned Learning for Video Summarization

論文の概要: Uncertainty-Aware and Decoder-Aligned Learning for Video Summarization

arxiv url: http://arxiv.org/abs/2605.09507v1
Date: Sun, 10 May 2026 12:31:02 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:50.286223
Title: Uncertainty-Aware and Decoder-Aligned Learning for Video Summarization
Title（参考訳）: ビデオ要約のための不確かさ認識とデコーダ対応学習
Authors: Omer Tariq, Syed Muhammad Raza, Jeongbae Son,
Abstract要約: ビデオ要約の目的は、時間的に重要なセグメントのサブセットを選択することで、長いビデオのコンパクトな表現を作ることである。この課題は、強い注釈主観性と離散復号法に依存するため本質的に困難である。本稿では,ビデオ要約のための不確実性とデコーダ対応の学習フレームワークであるVASTSumを提案する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video summarization aims to produce a compact representation of a long video by selecting a subset of temporally important segments that best reflect human preferences. This task is inherently difficult due to strong annotation subjectivity and the reliance on discrete decoding procedures, such as temporal segmentation and knapsack-based selection, during evaluation. Most existing approaches either learn deterministic importance scores that overlook these characteristics or adopt complex generative models that increase training and inference cost. In this paper, we propose VASTSum, an uncertainty-aware and decoder-aligned learning framework for video summarization that addresses both challenges within a single-pass model. The proposed method predicts probabilistic frame-level importance scores using a variational formulation, enabling explicit modeling of uncertainty arising from multi-annotator supervision. To account for subjectivity, particularly under binary annotations, we employ a supervision strategy that encourages alignment with plausible human annotation modes rather than enforcing a single consensus target. Furthermore, we introduce a decoder-aligned regularization that promotes stability of knapsack-based summary selection, reducing sensitivity to small perturbations in predicted scores. We evaluate the proposed framework on the SumMe and TVSum benchmarks using standard rank-based metrics. Experimental results show consistent and competitive Kendall and Spearman correlations across multiple data splits, demonstrating improved robustness under annotation disagreement while maintaining efficient single-forward inference. These results indicate that explicitly modeling uncertainty and aligning learning objectives with the decoding stage provide a principled alternative to both deterministic and diffusion-based video summarization methods.
Abstract（参考訳）: ビデオ要約は、人間の好みを最も反映した時間的に重要なセグメントのサブセットを選択することで、長いビデオのコンパクトな表現を作ることを目的としている。このタスクは、強いアノテーションの主観性と、評価中に時間分割やクナップサックに基づく選択のような離散的な復号処理に依存するため、本質的に困難である。既存のアプローチのほとんどは、これらの特徴を無視する決定論的重要性のスコアを学ぶか、トレーニングと推論コストを増加させる複雑な生成モデルを採用するかのどちらかです。本稿では,ビデオ要約のための不確実性とデコーダ対応の学習フレームワークであるVASTSumを提案する。提案手法は, 変動定式化を用いて確率的フレームレベルの重要度を予測し, マルチアノテータ監視による不確実性の明示的モデリングを可能にする。主観性を考慮し、特にバイナリアノテーションの下では、単一のコンセンサスターゲットを強制するのではなく、もっともらしい人間のアノテーションモードとの整合を奨励する監督戦略を採用する。さらに,knapsackに基づく要約選択の安定性を向上し,予測スコアの小さな摂動に対する感度を低下させるデコーダ整合正則化を提案する。標準ランクに基づく指標を用いて,SumMe と TVSum のベンチマークで提案したフレームワークの評価を行った。実験結果は、複数のデータ分割にまたがる一貫性のある競合するKendallとSpearmanの相関を示し、より効率的な単一フォワード推論を維持しながら、アノテーションの不一致の下で堅牢性を向上させることを示した。これらの結果は,不確かさを明示的にモデル化し,学習目標を復号段階に整合させることが,決定論的・拡散的ビデオ要約法に取って代わる基本的選択肢であることを示している。

論文の概要: Uncertainty-Aware and Decoder-Aligned Learning for Video Summarization

関連論文リスト