Related papers: Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones?

Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones?

URL: http://arxiv.org/abs/2406.12809v1
Date: Tue, 18 Jun 2024 17:25:47 GMT
Title: Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones?
Authors: Zhe Yang, Yichang Zhang, Tianyu Liu, Jian Yang, Junyang Lin, Chang Zhou, Zhifang Sui,
Abstract summary: Large language models (LLMs) have demonstrated impressive capabilities, but still suffer from inconsistency issues. We develop the ConsisEval benchmark, where each entry comprises a pair of questions with a strict order of difficulty. We analyze the potential for improvement in consistency by relative consistency score.
Score: 65.43882564649721
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have demonstrated impressive capabilities, but still suffer from inconsistency issues (e.g. LLMs can react differently to disturbances like rephrasing or inconsequential order change). In addition to these inconsistencies, we also observe that LLMs, while capable of solving hard problems, can paradoxically fail at easier ones. To evaluate this hard-to-easy inconsistency, we develop the ConsisEval benchmark, where each entry comprises a pair of questions with a strict order of difficulty. Furthermore, we introduce the concept of consistency score to quantitatively measure this inconsistency and analyze the potential for improvement in consistency by relative consistency score. Based on comprehensive experiments across a variety of existing models, we find: (1) GPT-4 achieves the highest consistency score of 92.2\% but is still inconsistent to specific questions due to distraction by redundant information, misinterpretation of questions, etc.; (2) models with stronger capabilities typically exhibit higher consistency, but exceptions also exist; (3) hard data enhances consistency for both fine-tuning and in-context learning. Our data and code will be publicly available on GitHub.

Related papers

Climbing the Ladder of Reasoning: What LLMs Can-and Still Can't-Solve after SFT? [59.418994222096885]
We conduct a detailed analysis of model performance on the AIME24 dataset. We categorize questions into four tiers (Easy, Medium, Hard, and Extremely Hard) We find that progression from Easy to Medium tier requires adopting an R1 reasoning style with minimal SFT-1K instances. Exh-level questions present a fundamentally different challenge; they require unconventional problem-solving skills.
arXiv Detail & Related papers (2025-04-16T03:39:38Z)
FindTheFlaws: Annotated Errors for Detecting Flawed Reasoning and Scalable Oversight Research [0.6286531904189063]
Approaches to scaling AI supervision include debate, critique, and prover-verifier games. We present FindTheFlaws, a group of five diverse datasets spanning medicine, mathematics, science, coding, and the Lojban language. We evaluate frontier models' critiquing capabilities and observe a range of performance that can be leveraged for scalable oversight experiments.
arXiv Detail & Related papers (2025-03-29T06:38:30Z)
Multidimensional Consistency Improves Reasoning in Language Models [21.989335720239467]
We introduce a framework for testing models for answer consistency across multiple input variations. We induce variations in (i) order of shots in prompt, (ii) problem phrasing, and (iii) languages used. Our framework consistently enhances mathematical reasoning performance on both monolingual dataset GSM8K and multilingual dataset MGSM, especially for smaller models.
arXiv Detail & Related papers (2025-03-04T14:41:05Z)
GAOKAO-Eval: Does high scores truly reflect strong capabilities in LLMs? [32.972545797220924]
Large Language Models (LLMs) are commonly evaluated using human-crafted benchmarks. GAOKAO-Eval reveals that high scores still fail to truly reflect human-aligned capabilities.
arXiv Detail & Related papers (2024-12-13T11:38:10Z)
Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization [126.27645170941268]
We present Easy2Hard-Bench, a collection of 6 benchmark datasets spanning various domains. Each problem within these datasets is annotated with numerical difficulty scores. We provide a comprehensive analysis of their performance and generalization capabilities across varying levels of difficulty.
arXiv Detail & Related papers (2024-09-27T03:49:56Z)
Large Language Models Struggle with Unreasonability in Math Problems [41.970853209666224]
Large Language Models (LLMs) have shown remarkable success on a wide range of math and reasoning benchmarks.<n>We observe that they often struggle when faced with unreasonable math problems.<n>We propose the textbfUnreasonable Math Problems (UMP) benchmark, designed to evaluate LLMs' ability to detect and respond to unreasonable math problem statements.
arXiv Detail & Related papers (2024-03-28T12:04:28Z)
CHAMP: A Competition-level Dataset for Fine-Grained Analyses of LLMs' Mathematical Reasoning Capabilities [25.857946070979576]
Concept and Hint-Annotated Math Problems (CHAMP) consists of high school math competition problems annotated with concepts. This benchmark is difficult, with the best model only scoring 58.1% in standard settings. We find that models often arrive at the correct final answer through wrong reasoning steps.
arXiv Detail & Related papers (2024-01-13T03:18:16Z)
The Unreasonable Effectiveness of Easy Training Data for Hard Tasks [84.30018805150607]
We present the surprising conclusion that current pretrained language models often generalize relatively well from easy to hard data. We demonstrate this kind of easy-to-hard generalization using simple finetuning methods like in-context learning, linear heads, and QLoRA. We conclude that easy-to-hard generalization in LMs is surprisingly strong for the tasks studied.
arXiv Detail & Related papers (2024-01-12T18:36:29Z)
WorldSense: A Synthetic Benchmark for Grounded Reasoning in Large Language Models [35.088946378980914]
We run our benchmark on three state-of-the-art chat-LLMs (GPT3.5, GPT4 and Llama2-chat) We show that these models make errors even with as few as three objects. Errors persist even with chain-of-thought prompting and in-context learning.
arXiv Detail & Related papers (2023-11-27T15:38:17Z)
Enhancing Large Language Models in Coding Through Multi-Perspective Self-Consistency [127.97467912117652]
Large language models (LLMs) have exhibited remarkable ability in code generation. However, generating the correct solution in a single attempt still remains a challenge. We propose the Multi-Perspective Self-Consistency (MPSC) framework incorporating both inter- and intra-consistency.
arXiv Detail & Related papers (2023-09-29T14:23:26Z)
Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs [78.31625291513589]
We argue that self-consistency is an important criteria for valid multi-step reasoning in tasks where the solution is composed of the answers to multiple sub-steps. We propose two types of self-consistency that are particularly important for multi-step reasoning -- hypothetical consistency and compositional consistency. We demonstrate that multiple variants of the GPT-3/-4 models exhibit poor consistency rates across both types of consistency on a variety of tasks.
arXiv Detail & Related papers (2023-05-23T17:25:59Z)
Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models [80.23791222509644]
Inconsistent AI models are considered brittle and untrustworthy by human users. We find that state-of-the-art vision-language models suffer from a surprisingly high degree of inconsistent behavior across tasks. We propose a rank correlation-based auxiliary training objective, computed over large automatically created cross-task contrast sets.
arXiv Detail & Related papers (2023-03-28T16:57:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.