Fugu-MT 論文翻訳(概要): SF20K Competition 2025: Summary and findings

論文の概要: SF20K Competition 2025: Summary and findings

arxiv url: http://arxiv.org/abs/2605.01496v1
Date: Sat, 02 May 2026 15:35:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-05 20:33:49.80466
Title: SF20K Competition 2025: Summary and findings
Title（参考訳）: SF20Kコンペティション2025:概要と結果
Authors: Ridouane Ghermi, Xi Wang, Vicky Kalogeiton, Ivan Laptev,
Abstract要約: ショートフィルム20K(SF20K)コンペティションは、ショートクリップアクション認識以上のストーリーレベルのビデオ理解を促進するように設計されている。モデルは人気映画の記憶よりもマルチモーダルな理解に頼らなければならない。優勝チームはメイントラックで65.7%、スペシャルトラックで48.7%、人間のパフォーマンス天井で91.7%の精度を達成した。
参考スコア（独自算出の注目度）: 34.86183179717155
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This report presents the results and findings of the first edition of the Short-Films 20K (SF20K) Competition, held in conjunction with the SLoMO Workshop at ICCV 2025. The competition is designed to advance story-level video understanding beyond short-clip action recognition, introducing an open-ended video question-answering task built on a corpus of amateur short films. This setup ensures that models must rely on multimodal understanding rather than memorization of popular movies. Evaluation is conducted using the SF20K-Test benchmark (95 movies, 979 question-answer pairs) and scored via LLM-QA-Eval, an automated judge based on GPT-4.1-nano. The competition attracted 22 teams and 286 submissions across two tracks: a Main Track with unrestricted model size and a Special Track limited to models under 8 billion parameters. The winning team achieved 65.7% accuracy on the Main Track and 48.7% on the Special Track, against a human performance ceiling of 91.7%. Our analysis reveals several key findings: narrative-aware, shot-level processing consistently outperforms uniform frame sampling; well-designed multi-stage pipelines using smaller models can match or exceed end-to-end inference with models over 30x larger; and subtitle quality is a dominant factor in performance. These results highlight that the primary bottleneck in long-form video QA lies in information selection and reasoning structure rather than raw model capacity, and that a substantial gap remains between current methods and human-level narrative comprehension.
Abstract（参考訳）: 本報告では,ICCV 2025のSLoMOワークショップと共同で開催されたSF20Kコンペティションの第1回大会の結果と成果を報告する。このコンペティションは、短編のアクション認識を超えてストーリーレベルのビデオ理解を推進し、アマチュア短編映画のコーパス上に構築されたオープンエンドのビデオ質問応答タスクを導入するように設計されている。この設定により、モデルは人気映画の記憶よりもマルチモーダルな理解に頼らなければならない。評価はSF20K-Testベンチマーク(95本の映画、979本の質問応答ペア)を用いて行われ、GPT-4.1-nanoに基づく自動判定器であるLLM-QA-Evalを介して行われる。競技には22のチームと286の応募があり、モデルサイズが制限されていないメイントラックと80億のパラメータ未満のモデルに制限されたスペシャルトラックの2つのトラックが参加した。優勝チームはメイントラックで65.7%、スペシャルトラックで48.7%、人間のパフォーマンス天井で91.7%を獲得した。我々の分析では、物語認識、ショットレベル処理が一様フレームサンプリングより一貫して優れていること、より小さなモデルを用いたよく設計されたマルチステージパイプラインが30倍以上のモデルでエンドツーエンドの推論に適合または超えること、サブタイトル品質がパフォーマンスの主要な要因であることを示す。これらの結果から,長期ビデオQAの主なボトルネックは,生のモデル能力よりも情報選択と推論構造にあり,現在の手法と人間レベルの物語理解との間には大きなギャップが残っていることが示唆された。

論文の概要: SF20K Competition 2025: Summary and findings

関連論文リスト