TennisTV: Do Multimodal Large Language Models Understand Tennis Rallies?
- URL: http://arxiv.org/abs/2509.15602v2
- Date: Mon, 22 Sep 2025 14:29:06 GMT
- Title: TennisTV: Do Multimodal Large Language Models Understand Tennis Rallies?
- Authors: Zhongyuan Bao, Lejun Zhang,
- Abstract summary: TennisTV is the first and most comprehensive benchmark for tennis video understanding.<n>It covers 9 tasks from the stroke level to the rally level and includes 2943 human-verified questions.
- Score: 0.41251872802992456
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal large language models (MLLMs) excel at general video understanding but struggle with fast, high-frequency sports like tennis, where rally clips are short yet information-dense. To systematically evaluate MLLMs in this challenging domain, we present TennisTV, the first and most comprehensive benchmark for tennis video understanding. TennisTV models each rally as a temporal-ordered sequence of consecutive stroke events, using automated pipelines for filtering and question generation. It covers 9 tasks from the stroke level to the rally level and includes 2943 human-verified questions. Evaluating 17 representative MLLMs, we provide the first systematic assessment of tennis video understanding. Results reveal substantial shortcomings and yield two key insights: (i) frame-sampling density should be tailored and balanced across tasks, and (ii) improving temporal grounding is essential for stronger reasoning.
Related papers
- SportR: A Benchmark for Multimodal Large Language Model Reasoning in Sports [21.410115837645318]
SportR is the first multi-sports large-scale benchmark designed to train and evaluate MLLMs on the fundamental reasoning required for sports intelligence.<n>Our benchmark provides a dataset of 5,017 images and 2,101 videos.<n>For the most advanced tasks requiring multi-step reasoning, such as determining penalties or explaining tactics, we provide 7,118 high-quality, human-authored Chain of Thought annotations.
arXiv Detail & Related papers (2025-11-09T18:55:20Z) - A Survey on Video Temporal Grounding with Multimodal Large Language Model [107.24431595873808]
Recent advancement in temporal grounding (VTG) has significantly enhanced fine-grained video understanding.<n>With superior multimodal comprehension and reasoning abilities, VTG approaches based on MLLMs (VTG-MLLMs) are gradually surpassing traditional fine-tuned methods.<n>Despite extensive surveys on general video-language understanding, comprehensive reviews specifically addressing VTG-MLLMs remain scarce.
arXiv Detail & Related papers (2025-08-07T08:52:11Z) - Enhancing Sports Strategy with Video Analytics and Data Mining: Assessing the effectiveness of Multimodal LLMs in tennis video analysis [0.0]
We aim to assess the effectiveness of MLLMs in analysing sports videos, focusing mainly on tennis videos.<n>Despite research done on tennis analysis, there remains a gap in models that are able to understand and identify the sequence of events in a tennis rally.<n>We will mainly assess the MLLMs on their ability to fill this gap - to classify tennis actions, as well as their ability to identify these actions in a sequence of tennis actions in a rally.
arXiv Detail & Related papers (2025-06-24T06:08:35Z) - SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models [80.3895950009792]
Achieving fine-grained-temporal understanding in videos remains a major challenge for current Video Large Multimodels (Video LMMs)<n>We contribute in three core aspects: dataset, model, and benchmark.<n>First, we introduce SAMA-239K, a large-scale dataset comprising 15K videos specifically to enable joint learning of video understanding, grounding, and multi-turn video chat.<n>Second, we propose the SAMA model, which incorporates a versatile-temporal context aggregator and a Segment Model to jointly enhance fine-grained video comprehension and precise grounding capabilities.
arXiv Detail & Related papers (2025-05-24T18:13:16Z) - Towards Universal Soccer Video Understanding [58.889409980618396]
This paper aims to a comprehensive multi-modal framework for soccer understanding.<n>We introduce SoccerReplay-1988, the largest multi-modal soccer dataset to date, featuring videos and detailed annotations from 1, complete matches.<n>We present an advanced soccer-specific visual, MatchVision, which leveragestemporal information across soccer videos and excels in various downstream tasks.
arXiv Detail & Related papers (2024-12-02T18:58:04Z) - Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level [63.18855743293851]
Motion-Grounded Video Reasoning is a new motion understanding task that requires visual answers (video segmentation masks) according to the input question.<n>This task extends existing grounding work on explicit action/motion grounding to a more general format by enabling implicit reasoning via questions.<n>We introduce a novel baseline model named Motion-Grounded Video Reasoning Assistant (MORA)
arXiv Detail & Related papers (2024-11-15T03:45:09Z) - SPORTU: A Comprehensive Sports Understanding Benchmark for Multimodal Large Language Models [15.062299319625701]
SPORTU is a benchmark designed to assess Multimodal Large Language Models (MLLMs) across multi-level sports reasoning tasks.<n>SPORTU comprises two key components: SPORTU-text, featuring 900 multiple-choice questions with human-annotated explanations for rule comprehension and strategy understanding.<n>SPORTU-video consists of 1,701 slow-motion video clips across 7 different sports and 12,048 QA pairs, designed to assess multi-level reasoning.
arXiv Detail & Related papers (2024-10-11T02:58:38Z) - Classification of Tennis Actions Using Deep Learning [0.0]
We investigate the potential and the challenges of using deep learning to classify tennis actions.
Three models of different size were trained and evaluated on the academic tennis dataset THETIS.
The best models achieve a generalization accuracy of 74 %, demonstrating a good performance for tennis action classification.
arXiv Detail & Related papers (2024-02-04T15:48:20Z) - MVBench: A Comprehensive Multi-modal Video Understanding Benchmark [63.14000659130736]
We introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench.
We first introduce a novel static-to-dynamic method to define these temporal-related tasks.
Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task.
arXiv Detail & Related papers (2023-11-28T17:59:04Z) - P2ANet: A Dataset and Benchmark for Dense Action Detection from Table Tennis Match Broadcasting Videos [64.57435509822416]
This work consists of 2,721 video clips collected from the broadcasting videos of professional table tennis matches in World Table Tennis Championships and Olympiads.
We formulate two sets of action detection problems -- emphaction localization and emphaction recognition.
The results confirm that TheName is still a challenging task and can be used as a special benchmark for dense action detection from videos.
arXiv Detail & Related papers (2022-07-26T08:34:17Z) - Temporally-Aware Feature Pooling for Action Spotting in Soccer
Broadcasts [86.56462654572813]
We focus our analysis on action spotting in soccer broadcast, which consists in temporally localizing the main actions in a soccer game.
We propose a novel feature pooling method based on NetVLAD, dubbed NetVLAD++, that embeds temporally-aware knowledge.
We train and evaluate our methodology on the recent large-scale dataset SoccerNet-v2, reaching 53.4% Average-mAP for action spotting.
arXiv Detail & Related papers (2021-04-14T11:09:03Z) - TTNet: Real-time temporal and spatial video analysis of table tennis [5.156484100374058]
We present a neural network aimed at real-time processing of high-resolution table tennis videos.
This approach gives core information for reasoning score updates by an auto-referee system.
We publish a multi-task dataset OpenTTGames with videos of table tennis games in 120 fps labeled with events.
arXiv Detail & Related papers (2020-04-21T11:57:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.