Fugu-MT 論文翻訳(概要): FCMBench-Video: Benchmarking Document Video Intelligence

論文の概要: FCMBench-Video: Benchmarking Document Video Intelligence

arxiv url: http://arxiv.org/abs/2604.25186v2
Date: Thu, 30 Apr 2026 03:30:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-01 14:06:12.655114
Title: FCMBench-Video: Benchmarking Document Video Intelligence
Title（参考訳）: FCMBench-Video: ドキュメントビデオインテリジェンスのベンチマーク
Authors: Runze Cui, Fangxin Shang, Yehui Yang, Qing Yang, Yanwu Xu, Tao Chen,
Abstract要約: FCMBench-Videoは、文書ビデオインテリジェンスのためのベンチマークである。文書認識、時間的根拠付け、根拠に基づく推論を評価する。 495の原子ビデオで構成され、1200の長ビデオで構成されている。
参考スコア（独自算出の注目度）: 8.515144837095095
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Document understanding is a critical capability in financial credit review, onboarding, and remote verification, where both decision accuracy and evidence traceability matter. Compared with static document images, document videos present a temporally redundant and sequentially unfolding evidence stream, require evidence integration across frames, and preserve acquisition-process cues relevant to authenticity-sensitive and anti-fraud review. We introduce FCMBench-Video, a benchmark for document-video intelligence that evaluates document perception, temporal grounding, and evidence-grounded reasoning under realistic capture conditions. For privacy-compliant yet realistic data at scale, we organize construction as an atomic-acquisition and composition workflow that records reusable single-document clips, applies controlled degradations, and assembles long-form multi-document videos with prescribed temporal spans. FCMBench-Video is built from 495 atomic videos composed into 1,200 long-form videos paired with 11,322 expert-annotated question--answer instances, covering 28 document types over 20s--60s duration tiers and 5,960 Chinese / 5,362 English instances. Evaluations on nine recent Video-MLLMs show that FCMBench-Video provides meaningful separation across systems and capabilities: counting is the most duration-sensitive task, Cross-Document Validation and Evidence-Grounded Selection probe higher-level evidence integration, and Visual Prompt Injection provides a complementary robustness dimension. The overall score distribution is broad and approximately bell-shaped, indicating a benchmark that is neither saturated nor dominated by trivial cases. Together, these results position FCMBench-Video as a reproducible benchmark for tracking Video-MLLM progress on document-video understanding and probing capability boundaries in authenticity-sensitive credit-domain applications.
Abstract（参考訳）: 文書理解は、意思決定の正確さと証拠のトレーサビリティの両方が重要となる、金融クレジットのレビュー、オンボーディング、リモート検証において重要な能力である。静的な文書画像と比較すると、文書ビデオは時間的に冗長で順次展開されるエビデンスストリームを示し、フレーム間のエビデンス統合を必要とし、認証に敏感で反詐欺的レビューに関連する買収プロセスの手がかりを保持する。 FCMBench-Videoは,現実的な捕獲条件下での文書認識,時間的根拠,証拠的根拠に基づく推論を評価する,文書映像インテリジェンスのためのベンチマークである。プライバシーに順応するが、大規模な現実的なデータに対して、我々は、再利用可能な単一文書クリップを記録し、制御された劣化を適用し、所定の時間間隔で長文の多文書ビデオを組み立てる、原子獲得と合成のワークフローとして構築を組織する。 FCMBench-Videoは、495の原子ビデオで構成され、1200の長文ビデオと11,322のエキスパート注釈付き質問応答インスタンスで構成され、20代から60代までの28のドキュメントタイプと5,960の中国語/5,362の英語のインスタンスを含んでいる。最近の9つのビデオMLLMの評価によると、FCMBench-Videoは、最も時間に敏感なタスクであるカウント、クロスドキュメントバリデーションとエビデンス・グラウンド(Evidence-Grounded Selection probe)、高レベルのエビデンス統合、Visual Prompt Injection(Visual Prompt Injection)など、システムと機能間で有意義な分離を提供する。スコアの分布は広く、おおよそベル型であり、飽和せず、自明なケースに支配されないベンチマークを示している。これらの結果と合わせて、FCMBench-Videoは、信頼度に敏感なクレジットドメインアプリケーションにおけるビデオMLLM進捗の追跡と、機能境界の探索のための再現可能なベンチマークとして位置づけられた。

論文の概要: FCMBench-Video: Benchmarking Document Video Intelligence

関連論文リスト