SPORTU: A Comprehensive Sports Understanding Benchmark for Multimodal Large Language Models
- URL: http://arxiv.org/abs/2410.08474v2
- Date: Sat, 19 Oct 2024 08:17:17 GMT
- Title: SPORTU: A Comprehensive Sports Understanding Benchmark for Multimodal Large Language Models
- Authors: Haotian Xia, Zhengbang Yang, Junbo Zou, Rhys Tracy, Yuqing Wang, Chi Lu, Christopher Lai, Yanjun He, Xun Shao, Zhuoqing Xie, Yuan-fang Wang, Weining Shen, Hanjie Chen,
- Abstract summary: SPORTU is a benchmark designed to assess Multimodal Large Language Models (MLLMs) across multi-level sports reasoning tasks.
SPORTU comprises two key components: SPORTU-text, featuring 900 multiple-choice questions with human-annotated explanations for rule comprehension and strategy understanding.
SPORTU-video consists of 1,701 slow-motion video clips across 7 different sports and 12,048 QA pairs, designed to assess multi-level reasoning.
- Score: 15.062299319625701
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Multimodal Large Language Models (MLLMs) are advancing the ability to reason about complex sports scenarios by integrating textual and visual information. To comprehensively evaluate their capabilities, we introduce SPORTU, a benchmark designed to assess MLLMs across multi-level sports reasoning tasks. SPORTU comprises two key components: SPORTU-text, featuring 900 multiple-choice questions with human-annotated explanations for rule comprehension and strategy understanding. This component focuses on testing models' ability to reason about sports solely through question-answering (QA), without requiring visual inputs; SPORTU-video, consisting of 1,701 slow-motion video clips across 7 different sports and 12,048 QA pairs, designed to assess multi-level reasoning, from simple sports recognition to complex tasks like foul detection and rule application. We evaluate four prevalent LLMs mainly utilizing few-shot learning paradigms supplemented by chain-of-thought (CoT) prompting on the SPORTU-text part. We evaluate four LLMs using few-shot learning and chain-of-thought (CoT) prompting on SPORTU-text. GPT-4o achieves the highest accuracy of 71%, but still falls short of human-level performance, highlighting room for improvement in rule comprehension and reasoning. The evaluation for the SPORTU-video part includes 7 proprietary and 6 open-source MLLMs. Experiments show that models fall short on hard tasks that require deep reasoning and rule-based understanding. Claude-3.5-Sonnet performs the best with only 52.6% accuracy on the hard task, showing large room for improvement. We hope that SPORTU will serve as a critical step toward evaluating models' capabilities in sports understanding and reasoning.
Related papers
- Deep learning for action spotting in association football videos [64.10841325879996]
The SoccerNet initiative organizes yearly challenges, during which participants from all around the world compete to achieve state-of-the-art performances.
This paper traces the history of action spotting in sports, from the creation of the task back in 2018, to the role it plays today in research and the sports industry.
arXiv Detail & Related papers (2024-10-02T07:56:15Z) - OnlySportsLM: Optimizing Sports-Domain Language Models with SOTA Performance under Billion Parameters [3.2586293270380717]
This paper explores the potential of a small, domain-specific language model trained exclusively on sports-related data.
OnlySportsLM achieves a 37.62%/34.08% accuracy improvement over previous 135M/360M state-of-the-art models.
arXiv Detail & Related papers (2024-08-30T22:39:35Z) - Sports Intelligence: Assessing the Sports Understanding Capabilities of Language Models through Question Answering from Text to Video [5.885902974241053]
Reasoning over complex sports scenarios has posed significant challenges to current NLP technologies.
Our evaluation spans from simple queries on basic rules and historical facts to complex, context-specific reasoning.
We propose a new benchmark based on a comprehensive overview of existing sports datasets and provided extensive error analysis.
arXiv Detail & Related papers (2024-06-21T05:57:50Z) - Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis [118.08008540513596]
Video-MME is the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis.
We extensively evaluate various state-of-the-art MLLMs, including GPT-4 series and Gemini 1.5 Pro, as well as open-source image models.
Our experiments reveal that Gemini 1.5 Pro is the best-performing commercial model, significantly outperforming the open-source models.
arXiv Detail & Related papers (2024-05-31T17:59:47Z) - SportQA: A Benchmark for Sports Understanding in Large Language Models [12.860652075730876]
SportQA is a benchmark specifically designed for evaluating Large Language Models (LLMs) in the context of sports understanding.
SportQA encompasses over 70,000 multiple-choice questions across three distinct difficulty levels.
Our results reveal that while LLMs exhibit competent performance in basic sports knowledge, they struggle with more complex, scenario-based sports reasoning.
arXiv Detail & Related papers (2024-02-24T17:12:10Z) - Sports-QA: A Large-Scale Video Question Answering Benchmark for Complex
and Professional Sports [90.79212954022218]
We introduce the first dataset, named Sports-QA, specifically designed for the sports VideoQA task.
Sports-QA dataset includes various types of questions, such as descriptions, chronologies, causalities, and counterfactual conditions.
We propose a new Auto-Focus Transformer (AFT) capable of automatically focusing on particular scales of temporal information for question answering.
arXiv Detail & Related papers (2024-01-03T02:22:34Z) - Take a Step Back: Evoking Reasoning via Abstraction in Large Language
Models [122.19845578690466]
Step-Back Prompting enables LLMs to do abstractions to derive high-level concepts and first principles from instances containing specific details.
Using the concepts and principles to guide reasoning, LLMs significantly improve their abilities in following a correct reasoning path towards the solution.
arXiv Detail & Related papers (2023-10-09T19:48:55Z) - TouchStone: Evaluating Vision-Language Models by Language Models [91.69776377214814]
We propose an evaluation method that uses strong large language models as judges to comprehensively evaluate the various abilities of LVLMs.
We construct a comprehensive visual dialogue dataset TouchStone, consisting of open-world images and questions, covering five major categories of abilities and 27 subtasks.
We demonstrate that powerful LVLMs, such as GPT-4, can effectively score dialogue quality by leveraging their textual capabilities alone.
arXiv Detail & Related papers (2023-08-31T17:52:04Z) - DeepSportradar-v1: Computer Vision Dataset for Sports Understanding with
High Quality Annotations [3.000319651350124]
This paper introduces DeepSportradar-v1, a suite of computer vision tasks, datasets and benchmarks for automated sport understanding.
The main purpose of this framework is to close the gap between academic research and real world settings.
arXiv Detail & Related papers (2022-08-17T09:55:02Z) - A Survey on Video Action Recognition in Sports: Datasets, Methods and
Applications [60.3327085463545]
We present a survey on video action recognition for sports analytics.
We introduce more than ten types of sports, including team sports, such as football, basketball, volleyball, hockey and individual sports, such as figure skating, gymnastics, table tennis, diving and badminton.
We develop a toolbox using PaddlePaddle, which supports football, basketball, table tennis and figure skating action recognition.
arXiv Detail & Related papers (2022-06-02T13:19:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.