Fugu-MT 論文翻訳(概要): DeepSport: A Multimodal Large Language Model for Comprehensive Sports Video Reasoning via Agentic Reinforcement Learning

論文の概要: DeepSport: A Multimodal Large Language Model for Comprehensive Sports Video Reasoning via Agentic Reinforcement Learning

arxiv url: http://arxiv.org/abs/2511.12908v1
Date: Mon, 17 Nov 2025 02:57:15 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-18 14:36:24.623795
Title: DeepSport: A Multimodal Large Language Model for Comprehensive Sports Video Reasoning via Agentic Reinforcement Learning
Title（参考訳）: DeepSport: エージェント強化学習による総合スポーツビデオ推論のための多モーダル大言語モデル
Authors: Junbo Zou, Haotian Xia, Zhen Ye, Shengjie Zhang, Christopher Lai, Vicente Ordonez, Weining Shen, Hanjie Chen,
Abstract要約: DeepSportは、マルチタスク、マルチスポーツビデオ理解のために設計された、エンドツーエンドでトレーニングされた最初のMLLMフレームワークである。我々の研究は、多様なスポーツの複雑さに対処するために、ドメイン固有のビデオ推論のための新しい基盤を確立する。
参考スコア（独自算出の注目度）: 25.001089287899998
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Sports video understanding presents unique challenges, requiring models to perceive high-speed dynamics, comprehend complex rules, and reason over long temporal contexts. While Multimodal Large Language Models (MLLMs) have shown promise in genral domains, the current state of research in sports remains narrowly focused: existing approaches are either single-sport centric, limited to specific tasks, or rely on training-free paradigms that lack robust, learned reasoning process. To address this gap, we introduce DeepSport, the first end-to-end trained MLLM framework designed for multi-task, multi-sport video understanding. DeepSport shifts the paradigm from passive frame processing to active, iterative reasoning, empowering the model to ``think with videos'' by dynamically interrogating content via a specialized frame-extraction tool. To enable this, we propose a data distillation pipeline that synthesizes high-quality Chain-of-Thought (CoT) trajectories from 10 diverse data source, creating a unified resource of 78k training data. We then employ a two-stage training strategy, Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) with a novel gated tool-use reward, to optimize the model's reasoning process. Extensive experiments on the testing benchmark of 6.7k questions demonstrate that DeepSport achieves state-of-the-art performance, significantly outperforming baselines of both proprietary model and open-source models. Our work establishes a new foundation for domain-specific video reasoning to address the complexities of diverse sports.
Abstract（参考訳）: スポーツビデオ理解は、モデルが高速なダイナミクスを知覚し、複雑なルールを理解し、長時間の時間的文脈に対する理性を要求するという、ユニークな課題を提示する。 MLLM(Multimodal Large Language Models)は、ジェネラルドメインにおける将来性を示しているが、スポーツ研究の現在の状況は狭く焦点を絞っている: 既存のアプローチはシングルスポーツ中心であり、特定のタスクに限定されているか、あるいは、堅牢で学習された推論プロセスに欠ける訓練のないパラダイムに依存している。このギャップに対処するために,マルチタスク,マルチスポーツビデオ理解のために設計された,エンドツーエンドでトレーニングされた最初のMLLMフレームワークであるDeepSportを紹介した。 DeepSportは、パラダイムをパッシブフレーム処理からアクティブで反復的な推論にシフトし、特殊なフレーム抽出ツールを通じてコンテンツを動的に問うことによって、モデルを‘ビデオによる思考’に権限を与える。これを実現するために,10種類のデータソースから高品質なChain-of-Thought(CoT)トラジェクトリを合成し,78kのトレーニングデータの統合リソースを作成するデータ蒸留パイプラインを提案する。次に、モデルの推論プロセスを最適化するために、2段階のトレーニング戦略であるSupervised Fine-Tuning(SFT)とReinforcement Learning(RL)を採用した。 6.7k の質問に対するテストベンチマークに関する大規模な実験は、DeepSportが最先端のパフォーマンスを達成し、プロプライエタリモデルとオープンソースモデルのベースラインを著しく上回っていることを示している。我々の研究は、多様なスポーツの複雑さに対処するために、ドメイン固有のビデオ推論のための新しい基盤を確立する。

論文の概要: DeepSport: A Multimodal Large Language Model for Comprehensive Sports Video Reasoning via Agentic Reinforcement Learning

関連論文リスト