SoccerMaster: A Vision Foundation Model for Soccer Understanding
- URL: http://arxiv.org/abs/2512.11016v1
- Date: Thu, 11 Dec 2025 18:03:30 GMT
- Title: SoccerMaster: A Vision Foundation Model for Soccer Understanding
- Authors: Haolin Yang, Jiayuan Rao, Haoning Wu, Weidi Xie,
- Abstract summary: Soccer understanding has recently garnered growing research interest due to its domain-specific complexity and unique challenges.<n>This work aims to propose a unified model to handle diverse soccer visual understanding tasks, ranging from fine-grained perception to semantic reasoning.<n>We present SoccerMaster, the first soccer-specific vision foundation model that unifies diverse understanding tasks within a single framework.
- Score: 50.88251190999469
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Soccer understanding has recently garnered growing research interest due to its domain-specific complexity and unique challenges. Unlike prior works that typically rely on isolated, task-specific expert models, this work aims to propose a unified model to handle diverse soccer visual understanding tasks, ranging from fine-grained perception (e.g., athlete detection) to semantic reasoning (e.g., event classification). Specifically, our contributions are threefold: (i) we present SoccerMaster, the first soccer-specific vision foundation model that unifies diverse understanding tasks within a single framework via supervised multi-task pretraining; (ii) we develop an automated data curation pipeline to generate scalable spatial annotations, and integrate them with various existing soccer video datasets to construct SoccerFactory, a comprehensive pretraining data resource; and (iii) we conduct extensive evaluations demonstrating that SoccerMaster consistently outperforms task-specific expert models across diverse downstream tasks, highlighting its breadth and superiority. The data, code, and model will be publicly available.
Related papers
- Multi-Modal Soccer Scene Analysis with Masked Pre-Training [16.853768247588743]
We propose a multi-modal architecture for analyzing soccer scenes from tactical camera footage.<n>Our solution integrates three distinct input modalities into a unified framework.<n>We show the effectiveness of our approach on a large-scale dataset.
arXiv Detail & Related papers (2025-12-22T16:18:45Z) - DeepSport: A Multimodal Large Language Model for Comprehensive Sports Video Reasoning via Agentic Reinforcement Learning [25.001089287899998]
DeepSport is the first end-to-end trained MLLM framework designed for multi-task, multi-sport video understanding.<n>Our work establishes a new foundation for domain-specific video reasoning to address the complexities of diverse sports.
arXiv Detail & Related papers (2025-11-17T02:57:15Z) - SportR: A Benchmark for Multimodal Large Language Model Reasoning in Sports [21.410115837645318]
SportR is the first multi-sports large-scale benchmark designed to train and evaluate MLLMs on the fundamental reasoning required for sports intelligence.<n>Our benchmark provides a dataset of 5,017 images and 2,101 videos.<n>For the most advanced tasks requiring multi-step reasoning, such as determining penalties or explaining tactics, we provide 7,118 high-quality, human-authored Chain of Thought annotations.
arXiv Detail & Related papers (2025-11-09T18:55:20Z) - Multi-Agent System for Comprehensive Soccer Understanding [74.84770843553845]
We construct SoccerWiki, the first large-scale multimodal soccer knowledge base, integrating rich domain knowledge about players, teams, referees, and venues.<n>We present SoccerBench, the largest and most comprehensive soccer-specific benchmark, featuring around 10K multimodal (text, image, video) multi-choice QA pairs across 13 distinct tasks.<n>We introduce SoccerAgent, a novel multi-agent system that decomposes complex soccer questions via collaborative reasoning.
arXiv Detail & Related papers (2025-05-06T17:59:31Z) - Towards Universal Soccer Video Understanding [58.889409980618396]
This paper aims to a comprehensive multi-modal framework for soccer understanding.<n>We introduce SoccerReplay-1988, the largest multi-modal soccer dataset to date, featuring videos and detailed annotations from 1, complete matches.<n>We present an advanced soccer-specific visual, MatchVision, which leveragestemporal information across soccer videos and excels in various downstream tasks.
arXiv Detail & Related papers (2024-12-02T18:58:04Z) - Specialized Foundation Models Struggle to Beat Supervised Baselines [60.23386520331143]
We look at three modalities -- genomics, satellite imaging, and time series -- with multiple recent FMs and compare them to a standard supervised learning workflow.<n>We find that it is consistently possible to train simple supervised models that match or even outperform the latest foundation models.
arXiv Detail & Related papers (2024-11-05T04:10:59Z) - MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks [59.09343552273045]
We propose a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks.
We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks.
Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models.
arXiv Detail & Related papers (2023-03-29T16:42:30Z) - UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes [91.24112204588353]
We introduce UViM, a unified approach capable of modeling a wide range of computer vision tasks.
In contrast to previous models, UViM has the same functional form for all tasks.
We demonstrate the effectiveness of UViM on three diverse and challenging vision tasks.
arXiv Detail & Related papers (2022-05-20T17:47:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.