Fugu-MT 論文翻訳(概要): AdaRD-key: Adaptive Relevance-Diversity Keyframe Sampling for Long-form Video understanding

論文の概要: AdaRD-key: Adaptive Relevance-Diversity Keyframe Sampling for Long-form Video understanding

arxiv url: http://arxiv.org/abs/2510.02778v1
Date: Fri, 03 Oct 2025 07:19:34 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-06 16:35:52.296957
Title: AdaRD-key: Adaptive Relevance-Diversity Keyframe Sampling for Long-form Video understanding
Title（参考訳）: AdaRDキー: 長めのビデオ理解のための適応的妥当性・多様性キーフレームサンプリング
Authors: Xian Zhang, Zexi Wu, Zinuo Li, Hongming Xu, Luqi Gong, Farid Boussaid, Naoufel Werghi, Mohammed Bennamoun,
Abstract要約: AdaRD-Keyは,問合せ駆動型長文ビデオ理解のためのトレーニング不要サンプリングモジュールである。 AdaRD-Keyは、ビデオのアライメントが弱いワイドクエリを処理するために、軽量な関連性認識ゲーティング機構を採用している。私たちのパイプラインは、トレーニング不要で、計算効率が良い(単一のGPU上でリアルタイムに実行される)ため、既存のビジョン言語モデルと互換性があります。
参考スコア（独自算出の注目度）: 31.685368980481968
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Understanding long-form videos remains a significant challenge for vision--language models (VLMs) due to their extensive temporal length and high information density. Most current multimodal large language models (MLLMs) rely on uniform sampling, which often overlooks critical moments, leading to incorrect responses to queries. In parallel, many keyframe selection approaches impose rigid temporal spacing: once a frame is chosen, an exclusion window suppresses adjacent timestamps to reduce redundancy. While effective at limiting overlap, this strategy frequently misses short, fine-grained cues near important events. Other methods instead emphasize visual diversity but neglect query relevance. We propose AdaRD-Key, a training-free keyframe sampling module for query-driven long-form video understanding. AdaRD-Key maximizes a unified Relevance--Diversity Max-Volume (RD-MV) objective, combining a query-conditioned relevance score with a log-determinant diversity component to yield informative yet non-redundant frames. To handle broad queries with weak alignment to the video, AdaRD-Key employs a lightweight relevance-aware gating mechanism; when the relevance distribution indicates weak alignment, the method seamlessly shifts into a diversity-only mode, enhancing coverage without additional supervision. Our pipeline is training-free, computationally efficient (running in real time on a single GPU), and compatible with existing VLMs in a plug-and-play manner. Extensive experiments on LongVideoBench and Video-MME demonstrate state-of-the-art performance, particularly on long-form videos. Code available at https://github.com/Xian867/AdaRD-Key.
Abstract（参考訳）: 長いビデオを理解することは、その時間長と高情報密度のため、視覚言語モデル(VLM)にとって重要な課題である。現在のMLLM(Multimodal large language model)のほとんどは、一様サンプリングに依存しており、しばしば重要な瞬間を見落とし、クエリに対する誤った応答をもたらす。並行して、多くのキーフレーム選択アプローチでは、フレームが選択されると、排他ウィンドウが隣接するタイムスタンプを抑圧し、冗長性を減少させる。オーバーラップの制限に効果があるが、この戦略は重要なイベントに近い、短くきめ細かい手がかりを見逃すことが多い。他の方法は、視覚的な多様性を強調するが、クエリの関連性を無視する。 AdaRD-Keyは,クエリ駆動長文ビデオ理解のためのトレーニング不要なキーフレームサンプリングモジュールである。 AdaRD-Keyは、クエリ条件付き関連スコアとログ決定型多様性コンポーネントを組み合わせた、統一された関連性-多様性Max-Volume(RD-MV)目標を最大化する。 AdaRD-Keyは、ビデオとのアライメントが弱いワイドクエリを扱うために、軽量なレバレンス対応ゲーティング機構を採用している。私たちのパイプラインは、トレーニング不要で、(1つのGPU上でリアルタイムに実行される)計算効率が高く、プラグイン・アンド・プレイ方式で既存のVLMと互換性があります。 LongVideoBenchとVideo-MMEの大規模な実験は、特にロングフォームビデオにおける最先端のパフォーマンスを実証している。コードはhttps://github.com/Xian867/AdaRD-Keyで公開されている。

論文の概要: AdaRD-key: Adaptive Relevance-Diversity Keyframe Sampling for Long-form Video understanding

関連論文リスト