SportR: A Benchmark for Multimodal Large Language Model Reasoning in Sports
- URL: http://arxiv.org/abs/2511.06499v2
- Date: Mon, 17 Nov 2025 03:11:19 GMT
- Title: SportR: A Benchmark for Multimodal Large Language Model Reasoning in Sports
- Authors: Haotian Xia, Haonan Ge, Junbo Zou, Hyun Woo Choi, Xuebin Zhang, Danny Suradja, Botao Rui, Ethan Tran, Wendy Jin, Zhen Ye, Xiyang Lin, Christopher Lai, Shengjie Zhang, Junwen Miao, Shichao Chen, Rhys Tracy, Vicente Ordonez, Weining Shen, Hanjie Chen,
- Abstract summary: SportR is the first multi-sports large-scale benchmark designed to train and evaluate MLLMs on the fundamental reasoning required for sports intelligence.<n>Our benchmark provides a dataset of 5,017 images and 2,101 videos.<n>For the most advanced tasks requiring multi-step reasoning, such as determining penalties or explaining tactics, we provide 7,118 high-quality, human-authored Chain of Thought annotations.
- Score: 21.410115837645318
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Deeply understanding sports requires an intricate blend of fine-grained visual perception and rule-based reasoning - a challenge that pushes the limits of current multimodal models. To succeed, models must master three critical capabilities: perceiving nuanced visual details, applying abstract sport rule knowledge, and grounding that knowledge in specific visual evidence. Current sports benchmarks either cover single sports or lack the detailed reasoning chains and precise visual grounding needed to robustly evaluate these core capabilities in a multi-sport context. To address this gap, we introduce SportR, the first multi-sports large-scale benchmark designed to train and evaluate MLLMs on the fundamental reasoning required for sports intelligence. Our benchmark provides a dataset of 5,017 images and 2,101 videos. To enable granular evaluation, we structure our benchmark around a progressive hierarchy of question-answer (QA) pairs designed to probe reasoning at increasing depths - from simple infraction identification to complex penalty prediction. For the most advanced tasks requiring multi-step reasoning, such as determining penalties or explaining tactics, we provide 7,118 high-quality, human-authored Chain of Thought (CoT) annotations. In addition, our benchmark incorporates both image and video modalities and provides manual bounding box annotations to test visual grounding in the image part directly. Extensive experiments demonstrate the profound difficulty of our benchmark. State-of-the-art baseline models perform poorly on our most challenging tasks. While training on our data via Supervised Fine-Tuning and Reinforcement Learning improves these scores, they remain relatively low, highlighting a significant gap in current model capabilities. SportR presents a new challenge for the community, providing a critical resource to drive future research in multimodal sports reasoning.
Related papers
- Deep But Reliable: Advancing Multi-turn Reasoning for Thinking with Images [53.373427633330515]
We propose DRIM, a model that enables deep but reliable multi-turn reasoning when thinking with images in its multimodal CoT.<n>Based on a high-resolution image dataset, we construct high-difficulty and verifiable visual question-answer pairs.<n>In the SFT stage, we collect tool trajectories as cold-start data, guiding a multi-turn reasoning pattern.<n>In the RL stage, we introduce redundancy-penalized policy optimization, which incentivizes the model to develop a self-reflective reasoning pattern.
arXiv Detail & Related papers (2025-12-19T07:44:43Z) - SoccerMaster: A Vision Foundation Model for Soccer Understanding [50.88251190999469]
Soccer understanding has recently garnered growing research interest due to its domain-specific complexity and unique challenges.<n>This work aims to propose a unified model to handle diverse soccer visual understanding tasks, ranging from fine-grained perception to semantic reasoning.<n>We present SoccerMaster, the first soccer-specific vision foundation model that unifies diverse understanding tasks within a single framework.
arXiv Detail & Related papers (2025-12-11T18:03:30Z) - Learning Skill-Attributes for Transferable Assessment in Video [56.813876909367856]
Skill assessment from video entails rating the quality of a person's physical performance and explaining what could be done better.<n>Our CrossTrainer approach discovers skill-attributes, such as balance, control, and hand positioning.<n>By abstracting out the shared behaviors indicative of human skill, the proposed video representation generalizes substantially better than an array of existing techniques.
arXiv Detail & Related papers (2025-11-17T23:53:06Z) - DeepSport: A Multimodal Large Language Model for Comprehensive Sports Video Reasoning via Agentic Reinforcement Learning [25.001089287899998]
DeepSport is the first end-to-end trained MLLM framework designed for multi-task, multi-sport video understanding.<n>Our work establishes a new foundation for domain-specific video reasoning to address the complexities of diverse sports.
arXiv Detail & Related papers (2025-11-17T02:57:15Z) - FineQuest: Adaptive Knowledge-Assisted Sports Video Understanding via Agent-of-Thoughts Reasoning [10.942503187642851]
FineQuest is the first training-free framework that leverages dual-mode reasoning inspired by cognitive science.<n>FineQuest incorporates SSGraph, a multimodal sports knowledge scene graph spanning nine sports.<n>We introduce two new sports VideoQA benchmarks, Gym-QA and Diving-QA, derived from the FineGym and FineDiving datasets.
arXiv Detail & Related papers (2025-09-15T11:27:23Z) - SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models [93.73583158211115]
Achieving fine-grained-temporal understanding in videos remains a major challenge for current Video Large Multimodels (Video LMMs)<n>We contribute in three core aspects: dataset, model, and benchmark.<n>First, we introduce SAMA-239K, a large-scale dataset comprising 15K videos specifically to enable joint learning of video understanding, grounding, and multi-turn video chat.<n>Second, we propose the SAMA model, which incorporates a versatile-temporal context aggregator and a Segment Model to jointly enhance fine-grained video comprehension and precise grounding capabilities.
arXiv Detail & Related papers (2025-05-24T18:13:16Z) - STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training [87.58996020705258]
Video Large Language Models (Video-LLMs) have recently shown strong derivation in basic video understanding tasks.<n>Video-LLMs struggle with compositional reasoning that requires multi-step explicit-temporal inference across object relations, interactions and events.<n>We propose STEP, a novel graph-guided self-training method that enables VideoLLMs to generate reasoning-rich finetuning data from any raw videos to improve itself.
arXiv Detail & Related papers (2024-11-29T11:54:55Z) - SPORTU: A Comprehensive Sports Understanding Benchmark for Multimodal Large Language Models [15.062299319625701]
SPORTU is a benchmark designed to assess Multimodal Large Language Models (MLLMs) across multi-level sports reasoning tasks.<n>SPORTU comprises two key components: SPORTU-text, featuring 900 multiple-choice questions with human-annotated explanations for rule comprehension and strategy understanding.<n>SPORTU-video consists of 1,701 slow-motion video clips across 7 different sports and 12,048 QA pairs, designed to assess multi-level reasoning.
arXiv Detail & Related papers (2024-10-11T02:58:38Z) - Sports Intelligence: Assessing the Sports Understanding Capabilities of Language Models through Question Answering from Text to Video [5.885902974241053]
Reasoning over complex sports scenarios has posed significant challenges to current NLP technologies.
Our evaluation spans from simple queries on basic rules and historical facts to complex, context-specific reasoning.
We propose a new benchmark based on a comprehensive overview of existing sports datasets and provided extensive error analysis.
arXiv Detail & Related papers (2024-06-21T05:57:50Z) - Sports-QA: A Large-Scale Video Question Answering Benchmark for Complex and Professional Sports [104.40202007324633]
We introduce the first dataset, named Sports-QA, specifically designed for the sports VideoQA task.<n>Sports-QA dataset includes various types of questions, such as descriptions, chronologies, causalities, and counterfactual conditions.<n>We propose a new Auto-Focus Transformer (AFT) capable of automatically focusing on particular scales of temporal information for question answering.
arXiv Detail & Related papers (2024-01-03T02:22:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.