SportQA: A Benchmark for Sports Understanding in Large Language Models
- URL: http://arxiv.org/abs/2402.15862v2
- Date: Tue, 18 Jun 2024 03:29:51 GMT
- Title: SportQA: A Benchmark for Sports Understanding in Large Language Models
- Authors: Haotian Xia, Zhengbang Yang, Yuqing Wang, Rhys Tracy, Yun Zhao, Dongdong Huang, Zezhi Chen, Yan Zhu, Yuan-fang Wang, Weining Shen,
- Abstract summary: SportQA is a benchmark specifically designed for evaluating Large Language Models (LLMs) in the context of sports understanding.
SportQA encompasses over 70,000 multiple-choice questions across three distinct difficulty levels.
Our results reveal that while LLMs exhibit competent performance in basic sports knowledge, they struggle with more complex, scenario-based sports reasoning.
- Score: 12.860652075730876
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: A deep understanding of sports, a field rich in strategic and dynamic content, is crucial for advancing Natural Language Processing (NLP). This holds particular significance in the context of evaluating and advancing Large Language Models (LLMs), given the existing gap in specialized benchmarks. To bridge this gap, we introduce SportQA, a novel benchmark specifically designed for evaluating LLMs in the context of sports understanding. SportQA encompasses over 70,000 multiple-choice questions across three distinct difficulty levels, each targeting different aspects of sports knowledge from basic historical facts to intricate, scenario-based reasoning tasks. We conducted a thorough evaluation of prevalent LLMs, mainly utilizing few-shot learning paradigms supplemented by chain-of-thought (CoT) prompting. Our results reveal that while LLMs exhibit competent performance in basic sports knowledge, they struggle with more complex, scenario-based sports reasoning, lagging behind human expertise. The introduction of SportQA marks a significant step forward in NLP, offering a tool for assessing and enhancing sports understanding in LLMs.
Related papers
- LLM The Genius Paradox: A Linguistic and Math Expert's Struggle with Simple Word-based Counting Problems [28.72485319617863]
LLMs struggle with some basic tasks that humans find trivial to handle, e.g., counting the number of character r's in the wordstrawberry.
We measure transferability of advanced mathematical and coding reasoning capabilities from specialized LLMs to simple counting tasks.
Compared with strategies such as finetuning and in-context learning, we show that engaging reasoning is the most robust and efficient way to help LLMs better perceive tasks.
arXiv Detail & Related papers (2024-10-18T04:17:16Z) - SPORTU: A Comprehensive Sports Understanding Benchmark for Multimodal Large Language Models [15.062299319625701]
SPORTU is a benchmark designed to assess Multimodal Large Language Models (MLLMs) across multi-level sports reasoning tasks.
SPORTU comprises two key components: SPORTU-text, featuring 900 multiple-choice questions with human-annotated explanations for rule comprehension and strategy understanding.
SPORTU-video consists of 1,701 slow-motion video clips across 7 different sports and 12,048 QA pairs, designed to assess multi-level reasoning.
arXiv Detail & Related papers (2024-10-11T02:58:38Z) - Sports Intelligence: Assessing the Sports Understanding Capabilities of Language Models through Question Answering from Text to Video [5.885902974241053]
Reasoning over complex sports scenarios has posed significant challenges to current NLP technologies.
Our evaluation spans from simple queries on basic rules and historical facts to complex, context-specific reasoning.
We propose a new benchmark based on a comprehensive overview of existing sports datasets and provided extensive error analysis.
arXiv Detail & Related papers (2024-06-21T05:57:50Z) - Automate Knowledge Concept Tagging on Math Questions with LLMs [48.5585921817745]
Knowledge concept tagging for questions plays a crucial role in contemporary intelligent educational applications.
Traditionally, these annotations have been conducted manually with help from pedagogical experts.
In this paper, we explore the automating the tagging task using Large Language Models (LLMs)
arXiv Detail & Related papers (2024-03-26T00:09:38Z) - FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks.
We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z) - GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations [87.99872683336395]
Large Language Models (LLMs) are integrated into critical real-world applications.
This paper evaluates LLMs' reasoning abilities in competitive environments.
We first propose GTBench, a language-driven environment composing 10 widely recognized tasks.
arXiv Detail & Related papers (2024-02-19T18:23:36Z) - Sports-QA: A Large-Scale Video Question Answering Benchmark for Complex
and Professional Sports [90.79212954022218]
We introduce the first dataset, named Sports-QA, specifically designed for the sports VideoQA task.
Sports-QA dataset includes various types of questions, such as descriptions, chronologies, causalities, and counterfactual conditions.
We propose a new Auto-Focus Transformer (AFT) capable of automatically focusing on particular scales of temporal information for question answering.
arXiv Detail & Related papers (2024-01-03T02:22:34Z) - Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools.
Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions.
Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z) - Through the Lens of Core Competency: Survey on Evaluation of Large
Language Models [27.271533306818732]
Large language model (LLM) has excellent performance and wide practical uses.
Existing evaluation tasks are difficult to keep up with the wide range of applications in real-world scenarios.
We summarize 4 core competencies of LLM, including reasoning, knowledge, reliability, and safety.
Under this competency architecture, similar tasks are combined to reflect corresponding ability, while new tasks can also be easily added into the system.
arXiv Detail & Related papers (2023-08-15T17:40:34Z) - Sentiment Analysis in the Era of Large Language Models: A Reality Check [69.97942065617664]
This paper investigates the capabilities of large language models (LLMs) in performing various sentiment analysis tasks.
We evaluate performance across 13 tasks on 26 datasets and compare the results against small language models (SLMs) trained on domain-specific datasets.
arXiv Detail & Related papers (2023-05-24T10:45:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.