When Reasoning Meets Information Aggregation: A Case Study with Sports Narratives
- URL: http://arxiv.org/abs/2406.12084v2
- Date: Fri, 04 Oct 2024 04:25:07 GMT
- Title: When Reasoning Meets Information Aggregation: A Case Study with Sports Narratives
- Authors: Yebowen Hu, Kaiqiang Song, Sangwoo Cho, Xiaoyang Wang, Wenlin Yao, Hassan Foroosh, Dong Yu, Fei Liu,
- Abstract summary: We examine the critical role of information aggregation in reasoning by requiring the LLM to analyze sports narratives.
We conduct comprehensive experiments with real NBA basketball data and present SportsGen, a new method to synthesize game narratives.
Our findings show that most models, including GPT-4o, often fail to accurately aggregate basketball scores due to frequent scoring patterns.
- Score: 46.04238534224658
- License:
- Abstract: Reasoning is most powerful when an LLM accurately aggregates relevant information. We examine the critical role of information aggregation in reasoning by requiring the LLM to analyze sports narratives. To succeed at this task, an LLM must infer points from actions, identify related entities, attribute points accurately to players and teams, and compile key statistics to draw conclusions. We conduct comprehensive experiments with real NBA basketball data and present SportsGen, a new method to synthesize game narratives. By synthesizing data, we can rigorously evaluate LLMs' reasoning capabilities under complex scenarios with varying narrative lengths and density of information. Our findings show that most models, including GPT-4o, often fail to accurately aggregate basketball scores due to frequent scoring patterns. Open-source models like Llama-3 further suffer from significant score hallucinations. Finally, the effectiveness of reasoning is influenced by narrative complexity, information density, and domain-specific terms, highlighting the challenges in analytical reasoning tasks.
Related papers
- Beyond Outcomes: Transparent Assessment of LLM Reasoning in Games [54.49589494014147]
GAMEBoT is a gaming arena designed for rigorous assessment of Large Language Models.
We benchmark 17 prominent LLMs across eight games, encompassing various strategic abilities and game characteristics.
Our results suggest that GAMEBoT presents a significant challenge, even when LLMs are provided with detailed CoT prompts.
arXiv Detail & Related papers (2024-12-18T08:32:53Z) - GameArena: Evaluating LLM Reasoning through Live Computer Games [25.415321902887598]
We introduce GameArena, a benchmark to evaluate large language models (LLMs) reasoning capabilities through interactive gameplay with humans.
GameArena consists of three games to test specific reasoning capabilities (e.g., deductive and inductive reasoning) while keeping participants entertained and engaged.
We collect over 2000 game sessions and provide detailed assessments of various reasoning capabilities for five state-of-the-art LLMs.
arXiv Detail & Related papers (2024-12-09T11:22:59Z) - Narrative Analysis of True Crime Podcasts With Knowledge Graph-Augmented Large Language Models [8.78598447041169]
Large language models (LLMs) still struggle with complex narrative arcs as well as narratives containing conflicting information.
Recent work indicates LLMs augmented with external knowledge bases can improve the accuracy and interpretability of the resulting models.
In this work, we analyze the effectiveness of applying knowledge graphs (KGs) in understanding true-crime podcast data.
arXiv Detail & Related papers (2024-11-01T21:49:00Z) - Failure Modes of LLMs for Causal Reasoning on Narratives [51.19592551510628]
We investigate the causal reasoning abilities of large language models (LLMs) through the representative problem of inferring causal relationships from narratives.
We find that even state-of-the-art language models rely on unreliable shortcuts, both in terms of the narrative presentation and their parametric knowledge.
arXiv Detail & Related papers (2024-10-31T12:48:58Z) - Can Large Language Models do Analytical Reasoning? [45.69642663863077]
This paper explores the cutting-edge Large Language Model with analytical reasoning on sports.
We find that GPT-4 stands out in effectiveness, followed by Claude-2.1, with GPT-3.5, Gemini-Pro, and Llama-2-70b lagging behind.
To our surprise, we observe that most models, including GPT-4, struggle to accurately count the total scores for NBA quarters despite showing strong performance in counting NFL quarter scores.
arXiv Detail & Related papers (2024-03-06T20:22:08Z) - GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations [87.99872683336395]
Large Language Models (LLMs) are integrated into critical real-world applications.
This paper evaluates LLMs' reasoning abilities in competitive environments.
We first propose GTBench, a language-driven environment composing 10 widely recognized tasks.
arXiv Detail & Related papers (2024-02-19T18:23:36Z) - SportsMetrics: Blending Text and Numerical Data to Understand Information Fusion in LLMs [43.514367330413144]
We introduce four novel tasks centered around sports data analytics to evaluate the numerical reasoning and information fusion capabilities of LLMs.
These tasks involve providing LLMs with detailed, play-by-play sports game descriptions, then challenging them with adversarial scenarios.
We conduct extensive experiments on NBA and NFL games to assess the performance of LLMs on these tasks.
arXiv Detail & Related papers (2024-02-15T20:26:07Z) - Exploring the Factual Consistency in Dialogue Comprehension of Large Language Models [51.75805497456226]
This work focuses on the factual consistency issue with the help of the dialogue summarization task.
Our evaluation shows that, on average, 26.8% of the summaries generated by LLMs contain factual inconsistency.
To stimulate and enhance the dialogue comprehension ability of LLMs, we propose a fine-tuning paradigm with auto-constructed multi-task data.
arXiv Detail & Related papers (2023-11-13T09:32:12Z) - Concise and Organized Perception Facilitates Reasoning in Large Language Models [32.71672086718057]
We show that large language models (LLMs) exhibit failure patterns akin to human-like cognitive biases when dealing with disordered and irrelevant content in reasoning tasks.
We propose a novel reasoning approach named Concise and Organized Perception (COP)
COP carefully analyzes the given statements to identify the most pertinent information while eliminating redundancy efficiently.
arXiv Detail & Related papers (2023-10-05T04:47:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.