DeepSport: A Multimodal Large Language Model for Comprehensive Sports Video Reasoning via Agentic Reinforcement Learning
- URL: http://arxiv.org/abs/2511.12908v1
- Date: Mon, 17 Nov 2025 02:57:15 GMT
- Title: DeepSport: A Multimodal Large Language Model for Comprehensive Sports Video Reasoning via Agentic Reinforcement Learning
- Authors: Junbo Zou, Haotian Xia, Zhen Ye, Shengjie Zhang, Christopher Lai, Vicente Ordonez, Weining Shen, Hanjie Chen,
- Abstract summary: DeepSport is the first end-to-end trained MLLM framework designed for multi-task, multi-sport video understanding.<n>Our work establishes a new foundation for domain-specific video reasoning to address the complexities of diverse sports.
- Score: 25.001089287899998
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Sports video understanding presents unique challenges, requiring models to perceive high-speed dynamics, comprehend complex rules, and reason over long temporal contexts. While Multimodal Large Language Models (MLLMs) have shown promise in genral domains, the current state of research in sports remains narrowly focused: existing approaches are either single-sport centric, limited to specific tasks, or rely on training-free paradigms that lack robust, learned reasoning process. To address this gap, we introduce DeepSport, the first end-to-end trained MLLM framework designed for multi-task, multi-sport video understanding. DeepSport shifts the paradigm from passive frame processing to active, iterative reasoning, empowering the model to ``think with videos'' by dynamically interrogating content via a specialized frame-extraction tool. To enable this, we propose a data distillation pipeline that synthesizes high-quality Chain-of-Thought (CoT) trajectories from 10 diverse data source, creating a unified resource of 78k training data. We then employ a two-stage training strategy, Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) with a novel gated tool-use reward, to optimize the model's reasoning process. Extensive experiments on the testing benchmark of 6.7k questions demonstrate that DeepSport achieves state-of-the-art performance, significantly outperforming baselines of both proprietary model and open-source models. Our work establishes a new foundation for domain-specific video reasoning to address the complexities of diverse sports.
Related papers
- ERNIE 5.0 Technical Report [244.36480708815316]
ERNIE 5.0 is a unified autoregressive foundation model for unified multimodal understanding and generation across text, image, video, and audio.<n>To address practical challenges in large-scale deployment under diverse resource constraints, ERNIE 5.0 adopts a novel elastic training paradigm.<n>We show that ERNIE 5.0 achieves strong and balanced performance across multiple modalities.
arXiv Detail & Related papers (2026-02-04T16:18:15Z) - SoccerMaster: A Vision Foundation Model for Soccer Understanding [50.88251190999469]
Soccer understanding has recently garnered growing research interest due to its domain-specific complexity and unique challenges.<n>This work aims to propose a unified model to handle diverse soccer visual understanding tasks, ranging from fine-grained perception to semantic reasoning.<n>We present SoccerMaster, the first soccer-specific vision foundation model that unifies diverse understanding tasks within a single framework.
arXiv Detail & Related papers (2025-12-11T18:03:30Z) - OneThinker: All-in-one Reasoning Model for Image and Video [45.8205286430071]
We propose OneThinker, an all-in-one reasoning model that unifies image and video understanding across diverse visual tasks.<n>Experiments show that OneThinker delivers strong performance on 31 benchmarks, across 10 fundamental visual understanding tasks.
arXiv Detail & Related papers (2025-12-02T18:59:52Z) - Learning Skill-Attributes for Transferable Assessment in Video [56.813876909367856]
Skill assessment from video entails rating the quality of a person's physical performance and explaining what could be done better.<n>Our CrossTrainer approach discovers skill-attributes, such as balance, control, and hand positioning.<n>By abstracting out the shared behaviors indicative of human skill, the proposed video representation generalizes substantially better than an array of existing techniques.
arXiv Detail & Related papers (2025-11-17T23:53:06Z) - SportR: A Benchmark for Multimodal Large Language Model Reasoning in Sports [21.410115837645318]
SportR is the first multi-sports large-scale benchmark designed to train and evaluate MLLMs on the fundamental reasoning required for sports intelligence.<n>Our benchmark provides a dataset of 5,017 images and 2,101 videos.<n>For the most advanced tasks requiring multi-step reasoning, such as determining penalties or explaining tactics, we provide 7,118 high-quality, human-authored Chain of Thought annotations.
arXiv Detail & Related papers (2025-11-09T18:55:20Z) - Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models [78.32948112203228]
Video understanding represents the most challenging frontier in computer vision.<n>Recent emergence of Video-Large Multitemporal Models has demonstrated remarkable capabilities in video understanding tasks.<n>Survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities.
arXiv Detail & Related papers (2025-10-06T17:10:44Z) - Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency [56.475612147721264]
We propose a dual-reward formulation that supervises both semantic and temporal reasoning through discrete and continuous reward signals.<n>We evaluate our approach across eight representative video understanding tasks, including VideoQA, Temporal Video Grounding, and Grounded VideoQA.<n>Results underscore the importance of reward design and data selection in advancing reasoning-centric video understanding with MLLMs.
arXiv Detail & Related papers (2025-06-02T17:28:26Z) - Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning [58.86928947970342]
Embodied-R is a framework combining large-scale Vision-Language Models for perception and small-scale Language Models for reasoning.<n>After training on only 5k embodied video samples, Embodied-R with a 3B LM matches state-of-the-art multimodal reasoning models.<n>Embodied-R also exhibits emergent thinking patterns such as systematic analysis and contextual integration.
arXiv Detail & Related papers (2025-04-17T06:16:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.