Can Large Language Models do Analytical Reasoning?
- URL: http://arxiv.org/abs/2403.04031v1
- Date: Wed, 6 Mar 2024 20:22:08 GMT
- Title: Can Large Language Models do Analytical Reasoning?
- Authors: Yebowen Hu, Kaiqiang Song, Sangwoo Cho, Xiaoyang Wang, Hassan Foroosh,
Dong Yu, Fei Liu
- Abstract summary: This paper explores the cutting-edge Large Language Model with analytical reasoning on sports.
We find that GPT-4 stands out in effectiveness, followed by Claude-2.1, with GPT-3.5, Gemini-Pro, and Llama-2-70b lagging behind.
To our surprise, we observe that most models, including GPT-4, struggle to accurately count the total scores for NBA quarters despite showing strong performance in counting NFL quarter scores.
- Score: 45.69642663863077
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper explores the cutting-edge Large Language Model with analytical
reasoning on sports. Our analytical reasoning embodies the tasks of letting
large language models count how many points each team scores in a quarter in
the NBA and NFL games. Our major discoveries are in two folds. Firstly, we find
among all the models we employed, GPT-4 stands out in effectiveness, followed
by Claude-2.1, with GPT-3.5, Gemini-Pro, and Llama-2-70b lagging behind.
Specifically, we compare three different prompting techniques and a
divide-and-conquer approach, we find that the latter was the most effective.
Our divide-and-conquer approach breaks down play-by-play data into smaller,
more manageable segments, solves each piece individually, and then aggregates
them together. Besides the divide-and-conquer approach, we also explore the
Chain of Thought (CoT) strategy, which markedly improves outcomes for certain
models, notably GPT-4 and Claude-2.1, with their accuracy rates increasing
significantly. However, the CoT strategy has negligible or even detrimental
effects on the performance of other models like GPT-3.5 and Gemini-Pro.
Secondly, to our surprise, we observe that most models, including GPT-4,
struggle to accurately count the total scores for NBA quarters despite showing
strong performance in counting NFL quarter scores. This leads us to further
investigate the factors that impact the complexity of analytical reasoning
tasks with extensive experiments, through which we conclude that task
complexity depends on the length of context, the information density, and the
presence of related information. Our research provides valuable insights into
the complexity of analytical reasoning tasks and potential directions for
developing future large language models.
Related papers
- Evaluating the Performance of Large Language Models for SDG Mapping (Technical Report) [6.789534723913505]
Large language models (LLMs) enable users to protect data privacy by eliminating the need to provide data to third parties.
We compare the performance of various language models on the Sustainable Development Goal mapping task.
According to the results of this study, LLaMA 2 and Gemma still have significant room for improvement.
arXiv Detail & Related papers (2024-08-05T03:05:02Z) - When Reasoning Meets Information Aggregation: A Case Study with Sports Narratives [46.04238534224658]
We examine the critical role of information aggregation in reasoning by requiring the LLM to analyze sports narratives.
We conduct comprehensive experiments with real NBA basketball data and present SportsGen, a new method to synthesize game narratives.
Our findings show that most models, including GPT-4o, often fail to accurately aggregate basketball scores due to frequent scoring patterns.
arXiv Detail & Related papers (2024-06-17T20:49:35Z) - GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents [4.209869303518743]
We introduce GameBench, a cross-domain benchmark for evaluating strategic reasoning abilities of large language models.
Our evaluations use GPT-3 and GPT-4 in their base form along with two scaffolding frameworks designed to enhance strategic reasoning ability: Chain-of-Thought (CoT) prompting and Reasoning Via Planning (RAP)
Our results show that none of the tested models match human performance, and at worst GPT-4 performs worse than random action.
arXiv Detail & Related papers (2024-06-07T00:28:43Z) - MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large
Language Models [70.92847554971065]
We introduce MT-Eval, a comprehensive benchmark designed to evaluate multi-turn conversational abilities.
By analyzing human-LLM conversations, we categorize interaction patterns into four types: recollection, expansion, refinement, and follow-up.
Our evaluation of 11 well-known LLMs shows that while closed-source models generally surpass open-source ones, certain open-source models exceed GPT-3.5-Turbo in specific tasks.
arXiv Detail & Related papers (2024-01-30T04:50:28Z) - Gemini vs GPT-4V: A Preliminary Comparison and Combination of
Vision-Language Models Through Qualitative Cases [98.35348038111508]
This paper presents an in-depth comparative study of two pioneering models: Google's Gemini and OpenAI's GPT-4V(ision)
The core of our analysis delves into the distinct visual comprehension abilities of each model.
Our findings illuminate the unique strengths and niches of both models.
arXiv Detail & Related papers (2023-12-22T18:59:58Z) - Enhancing Large Language Models in Coding Through Multi-Perspective Self-Consistency [127.97467912117652]
Large language models (LLMs) have exhibited remarkable ability in code generation.
However, generating the correct solution in a single attempt still remains a challenge.
We propose the Multi-Perspective Self-Consistency (MPSC) framework incorporating both inter- and intra-consistency.
arXiv Detail & Related papers (2023-09-29T14:23:26Z) - GPT-3.5, GPT-4, or BARD? Evaluating LLMs Reasoning Ability in Zero-Shot
Setting and Performance Boosting Through Prompts [0.0]
Large Language Models (LLMs) have exhibited remarkable performance on various Natural Language Processing (NLP) tasks.
In this paper, we examine the performance of GPT-3.5, GPT-4, and BARD models, by performing a thorough technical evaluation on different reasoning tasks.
arXiv Detail & Related papers (2023-05-21T14:45:17Z) - Exploring the Trade-Offs: Unified Large Language Models vs Local
Fine-Tuned Models for Highly-Specific Radiology NLI Task [49.50140712943701]
We evaluate the performance of ChatGPT/GPT-4 on a radiology NLI task and compare it to other models fine-tuned specifically on task-related data samples.
We also conduct a comprehensive investigation on ChatGPT/GPT-4's reasoning ability by introducing varying levels of inference difficulty.
arXiv Detail & Related papers (2023-04-18T17:21:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.