Are Your LLMs Capable of Stable Reasoning?
- URL: http://arxiv.org/abs/2412.13147v4
- Date: Sun, 01 Jun 2025 06:40:28 GMT
- Title: Are Your LLMs Capable of Stable Reasoning?
- Authors: Junnan Liu, Hongwei Liu, Linchen Xiao, Ziyi Wang, Kuikun Liu, Songyang Gao, Wenwei Zhang, Songyang Zhang, Kai Chen,
- Abstract summary: We introduce G-Pass@$k$, a novel evaluation metric that continuously assesses model performance across multiple sampling attempts.<n>We employ G-Pass@$k$ in conjunction with state-of-the-art large language models to provide comprehensive insights into their potential capabilities and operational consistency.
- Score: 38.03049704515947
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid advancement of large language models (LLMs) has shown remarkable progress in complex reasoning tasks. However, a significant disparity exists between benchmark performances and real-world applications. We attribute this gap primarily to current evaluation protocols and metrics, which inadequately capture the full spectrum of LLM capabilities, especially in complex reasoning tasks where both accuracy and consistency are essential. In this paper, we introduce G-Pass@$k$, a novel evaluation metric that continuously assesses model performance across multiple sampling attempts, quantifying both the model's performance potential and its stability. Through extensive experiments on various public and newly constructed benchmarks, we employ G-Pass@$k$ in conjunction with state-of-the-art large language models to provide comprehensive insights into their potential capabilities and operational consistency. Our findings reveal a significant opportunity to enhance the realistic reasoning abilities of LLMs, underscoring the necessity for more robust evaluation metrics.
Related papers
- UniToMBench: Integrating Perspective-Taking to Improve Theory of Mind in LLMs [1.4304078520604593]
Theory of Mind (ToM) remains a challenging area for large language models (LLMs)<n>In this paper, we introduce UniToMBench, a unified benchmark that integrates the strengths of SimToM and TOMBENCH.
arXiv Detail & Related papers (2025-06-11T06:55:40Z) - EIFBENCH: Extremely Complex Instruction Following Benchmark for Large Language Models [65.48902212293903]
We present the Extremely Complex Instruction Following Benchmark (EIFBENCH) for evaluating large language models (LLMs)<n>EIFBENCH includes multi-task scenarios that enable comprehensive assessment across diverse task types concurrently.<n>We also propose the Segment Policy Optimization (SegPO) algorithm to enhance the LLM's ability to accurately fulfill multi-task workflow.
arXiv Detail & Related papers (2025-06-10T02:39:55Z) - IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis [60.32962597618861]
IDA-Bench is a novel benchmark evaluating large language models in multi-round interactive scenarios.<n>Agent performance is judged by comparing its final numerical output to the human-derived baseline.<n>Even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests.
arXiv Detail & Related papers (2025-05-23T09:37:52Z) - Revisiting LLM Evaluation through Mechanism Interpretability: a New Metric and Model Utility Law [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.
We propose a novel metric, the Model Utilization Index (MUI), which introduces mechanism interpretability techniques to complement traditional performance metrics.
arXiv Detail & Related papers (2025-04-10T04:09:47Z) - Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions [8.069858557211132]
Large Language Models (LLMs) have shown remarkable capabilities across various tasks.<n>Their deployment in high-stake domains requires consistent and coherent behavior across multiple rounds of user interaction.<n>This paper introduces a comprehensive framework for evaluating and improving LLM response consistency.
arXiv Detail & Related papers (2025-03-28T11:49:56Z) - Line Goes Up? Inherent Limitations of Benchmarks for Evaluating Large Language Models [0.0]
I argue that inherent limitations with the benchmarking paradigm render benchmark performance highly unsuitable as a metric for generalisable competence over cognitive tasks.<n>I conclude that benchmark performance should not be used as a reliable indicator of general LLM cognitive capabilities.
arXiv Detail & Related papers (2025-02-20T07:13:29Z) - Evaluating and Advancing Multimodal Large Language Models in Perception Ability Lens [30.083110119139793]
We introduce textbfAbilityLens, a unified benchmark designed to evaluate MLLMs in six key perception abilities.<n>We identify the strengths and weaknesses of current main-stream MLLMs, highlighting stability patterns and revealing a notable performance gap between state-of-the-art open-source and closed-source models.
arXiv Detail & Related papers (2024-11-22T04:41:20Z) - FVEval: Understanding Language Model Capabilities in Formal Verification of Digital Hardware [4.480157114854711]
We present FVEval, the first comprehensive benchmark for characterizing large language models (LLMs) performance in tasks pertaining to formal verification (FV)
The benchmark consists of three sub-tasks that measure LLM capabilities at different levels.
We present both collections of expert-written verification collateral and methodologies to scalably generate synthetic examples aligned with FV.
arXiv Detail & Related papers (2024-10-15T21:48:57Z) - HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly [34.205934899868346]
We present HELMET, a comprehensive benchmark encompassing seven diverse, application-centric categories.
We find that synthetic tasks like NIAH are not good predictors of downstream performance.
While most LCLMs achieve perfect NIAH scores, open-source models significantly lag behind closed ones when the task requires full-context reasoning.
arXiv Detail & Related papers (2024-10-03T17:20:11Z) - DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity.
Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data.
Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.
We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.
Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions [10.28688988951815]
UBENCH is a benchmark for evaluating large language models.
It includes 3,978 multiple-choice questions covering knowledge, language, understanding, and reasoning abilities.
We also evaluate the reliability of 15 popular LLMs, finding GLM4 to be the most outstanding.
arXiv Detail & Related papers (2024-06-18T16:50:38Z) - On the Worst Prompt Performance of Large Language Models [93.13542053835542]
Performance of large language models (LLMs) is acutely sensitive to the phrasing of prompts.
We introduce RobustAlpacaEval, a new benchmark that consists of semantically equivalent case-level queries.
Experiments on RobustAlpacaEval with ChatGPT and six open-source LLMs from the Llama, Mistral, and Gemma families uncover substantial variability in model performance.
arXiv Detail & Related papers (2024-06-08T13:40:38Z) - GroundCocoa: A Benchmark for Evaluating Compositional & Conditional Reasoning in Language Models [14.108788704400643]
GroundCocoa is a lexically diverse benchmark connecting these reasoning skills to the real-world problem of flight booking.
Our task involves aligning detailed user preferences with available flight options presented in a multiple-choice format.
Results indicate a significant disparity in performance among current state-of-the-art LLMs with even the best performing model, GPT-4 Turbo, not exceeding 67% accuracy despite advanced prompting techniques.
arXiv Detail & Related papers (2024-04-05T17:36:26Z) - Dynamic Evaluation of Large Language Models by Meta Probing Agents [44.20074234421295]
We propose meta probing agents (MPA) to evaluate large language models (LLMs)
MPA is the key component of DyVal 2, which naturally extends the previous DyValcitepzhu2023dyval.
MPA designs the probing and judging agents to automatically transform an original evaluation problem into a new one following psychometric theory.
arXiv Detail & Related papers (2024-02-21T06:46:34Z) - NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language
Models via Complexity Classes [32.154637177467684]
NPHardEval is designed to evaluate the reasoning abilities of Large Language Models (LLMs) across a broad spectrum of 900 questions.
It is meticulously chosen to represent a wide range of complexity class below the NP-hard complexity class.
It is designed with a dynamic update mechanism, where the datapoints are refreshed on a monthly basis.
arXiv Detail & Related papers (2023-12-22T18:07:44Z) - TRACE: A Comprehensive Benchmark for Continual Learning in Large
Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety.
Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs.
We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z) - Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools.
Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions.
Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z) - Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models [61.28463542324576]
Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can generate human-like outputs.
We evaluate existing state-of-the-art VLMs and find that even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency.
We propose a two-stage training framework aimed at improving both the reasoning performance and consistency of VLMs.
arXiv Detail & Related papers (2023-09-08T17:49:44Z) - Large Language Models are Not Yet Human-Level Evaluators for Abstractive
Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization.
We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.