Fugu-MT 論文翻訳(概要): A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs

論文の概要: A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs

arxiv url: http://arxiv.org/abs/2606.04596v1
Date: Wed, 03 Jun 2026 08:34:31 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-04 20:44:18.632714
Title: A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs
Title（参考訳）: MLLMを用いたマルチビデオ要約における位置バイアスの系統的評価
Authors: Huangchen Xu, Yuan Wu, Yi Chang,
Abstract要約: 映像毎の要約の質がビデオの入力スロットによって変化しうるマルチビデオ要約における位置バイアスについて検討する。 9つのオープンソースおよびプロプライエタリなMLLMを評価し,3つの相補的指標を用いて位置効果を測定した。
参考スコア（独自算出の注目度）: 16.995082216096787
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (MLLMs) are increasingly used for video understanding, yet their reliability under multi-video inputs remains poorly understood. We study positional bias in multi-video summarization, where the quality of a per-video summary can change with the video's input slot even when the underlying content is unchanged. We construct a benchmark from ActivityNet and News videos, covering Cooking, Domestic, Leisure, and News settings with two- and four-video inputs. We evaluate nine open-source and proprietary MLLMs and measure position effects with three complementary metrics: Coverage, Directional Positional Bias (DPB), and Middle-Edge Gap (MEG). Our results show that positional effects are domain- and model-dependent: signed directional bias can be small even when middle positions underperform, and increasing visual or generation budget does not uniformly remove the imbalance. We further analyze prompt-level mitigation methods. Together, the results show that multi-video summarization remains sensitive to input protocol and position, motivating more robust order-invariant multimodal systems.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)はビデオ理解に使用されることが多いが、マルチビデオ入力下での信頼性はいまだによく分かっていない。マルチビデオ要約における位置バイアスについて検討し、基礎となる内容が変化しても映像の入力スロットによって映像ごとの要約の質が変化しうることを示した。本研究では,Cooking,Domestic,Leisure,Newsの設定を2本と4本の入力でカバーする,ActivityNetとNewsのビデオのベンチマークを構築した。我々は,9つのオープンソースおよびプロプライエタリMLLMを評価し,3つの相補的指標(Coverage, Directional Positional Bias (DPB),Middle-Edge Gap (MEG))で位置効果を測定した。この結果から, 位置の影響はドメイン依存とモデル依存であり, 中位が不備な場合でも符号付き方向バイアスは小さくなり, 視覚的・生成的予算の増大は不均衡を均一に除去しないことがわかった。我々はさらに、プロンプトレベルの緩和方法を分析する。その結果,マルチビデオの要約は入力プロトコルや位置に敏感なままであり,より堅牢な順序不変マルチモーダルシステムの動機となっていることがわかった。

論文の概要: A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs

関連論文リスト