Related papers: Benchmarking Large Language Models via Random Variables

Benchmarking Large Language Models via Random Variables

URL: http://arxiv.org/abs/2501.11790v3
Date: Sat, 15 Mar 2025 09:20:49 GMT
Title: Benchmarking Large Language Models via Random Variables
Authors: Zijin Hong, Hao Wu, Su Dong, Junnan Dong, Yilin Xiao, Yujing Zhang, Zhu Wang, Feiran Huang, Linyi Li, Hongxia Yang, Xiao Huang,
Abstract summary: Recent studies have raised concerns about the reliability of current mathematical benchmarks.<n>We propose RV-Bench, a framework for Benchmarking large language models via Random Variables in mathematical reasoning.<n>Our findings suggest that LLMs exhibit an imbalance in proficiency between encountered and "unseen" data domains.
Score: 40.65711363554025
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent studies have raised concerns about the reliability of current mathematical benchmarks, highlighting issues such as simplistic design and potential data contamination. Therefore, creating a reliable benchmark that effectively evaluates the genuine capabilities of large language models (LLMs) in mathematical reasoning remains a significant challenge. To address this, we propose RV-Bench, a framework for Benchmarking LLMs via Random Variables in mathematical reasoning. Specifically, the background content of a random variable question (RV question) mirrors the original problem in existing benchmarks, but the variable combinations are randomized, making it "unseen" by the LLMs. Models must completely understand the question pattern of the original problem to correctly answer RV questions with various variable values. As a result, the LLM's genuine capability in mathematical reasoning is reflected by its accuracy and robustness on RV-Bench. We conducted extensive experiments on over 30 representative LLMs across more than 1000 RV questions. Our findings suggest that LLMs exhibit an imbalance in proficiency between encountered and "unseen" data domains. Proficiency generalization across similar mathematical reasoning tasks is verified to be limited by accuracy and robustness, but it can still be enhanced through test-time scaling.

Related papers

Automatic Robustness Stress Testing of LLMs as Mathematical Problem Solvers [13.40970017743291]
Large language models (LLMs) have achieved distinguished performance on various reasoning-intensive tasks.<n>LLMs might still face the challenges of robustness issues and fail unexpectedly in some simple reasoning tasks.<n>We propose a novel framework, Automatic Robustness Checker (AR-Checker), to generate mathematical problem variants that maintain the semantic meanings of the original one but might fail the LLMs.
arXiv Detail & Related papers (2025-06-05T13:42:39Z)
Reasoning Capabilities and Invariability of Large Language Models [49.23570751696334]
We aim to provide a comprehensive analysis of Large Language Models' reasoning competence.<n>We introduce a new benchmark dataset with a series of simple reasoning questions demanding shallow logical reasoning.<n>An empirical analysis involving zero-shot and few-shot prompting across 24 LLMs of different sizes reveals that, while LLMs with over 70 billion parameters perform better in the zero-shot setting, there is still a large room for improvement.
arXiv Detail & Related papers (2025-05-01T18:12:30Z)
CounterBench: A Benchmark for Counterfactuals Reasoning in Large Language Models [5.409370027524351]
We evaluate the performance of large language models (LLMs) in counterfactual reasoning. We introduce a new benchmark dataset, CounterBench, comprising 1K counterfactual reasoning questions.
arXiv Detail & Related papers (2025-02-16T06:19:37Z)
MIR-Bench: Can Your LLM Recognize Complicated Patterns via Many-Shot In-Context Reasoning? [21.056519816264505]
We propose MIR-Bench, the first many-shot in-context reasoning benchmark for pattern recognition.<n>We study many novel problems for many-shot in-context reasoning, and acquired many insightful findings.
arXiv Detail & Related papers (2025-02-14T06:05:12Z)
Beyond the Singular: The Essential Role of Multiple Generations in Effective Benchmark Evaluation and Analysis [10.133537818749291]
Large language models (LLMs) have demonstrated significant utilities in real-world applications. Benchmark evaluations are crucial for assessing the capabilities of LLMs.
arXiv Detail & Related papers (2025-02-13T03:43:33Z)
Are Your LLMs Capable of Stable Reasoning? [38.03049704515947]
Large Language Models (LLMs) have demonstrated remarkable progress in complex reasoning tasks. However, a significant discrepancy persists between benchmark performances and real-world applications. We introduce G-Pass@k, a novel evaluation metric that provides a continuous assessment of model performance. We present LiveMathBench, a dynamic benchmark comprising challenging, contemporary mathematical problems.
arXiv Detail & Related papers (2024-12-17T18:12:47Z)
LLM The Genius Paradox: A Linguistic and Math Expert's Struggle with Simple Word-based Counting Problems [28.72485319617863]
LLMs struggle with some basic tasks that humans find trivial to handle, e.g., counting the number of character r's in the wordstrawberry. We measure transferability of advanced mathematical and coding reasoning capabilities from specialized LLMs to simple counting tasks. Compared with strategies such as finetuning and in-context learning, we show that engaging reasoning is the most robust and efficient way to help LLMs better perceive tasks.
arXiv Detail & Related papers (2024-10-18T04:17:16Z)
Quantifying Prediction Consistency Under Model Multiplicity in Tabular LLMs [10.494477811252034]
Fine-tuning large language models can lead to textitfine-tuning multiplicity, where equally well-performing models make conflicting predictions on the same inputs. This raises critical concerns about the robustness and reliability of Tabular LLMs. This work proposes a novel metric to quantify the robustness of individual predictions without expensive model retraining.
arXiv Detail & Related papers (2024-07-04T22:22:09Z)
LiveBench: A Challenging, Contamination-Limited LLM Benchmark [93.57775429120488]
We release LiveBench, the first benchmark that contains frequently-updated questions from recent information sources. We evaluate many prominent closed-source models, as well as dozens of open-source models ranging from 0.5B to 405B in size. Questions are added and updated on a monthly basis, and we release new tasks and harder versions of tasks over time.
arXiv Detail & Related papers (2024-06-27T16:47:42Z)
DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity. Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data. Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers [68.77382332826167]
Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks. One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly. This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations.
arXiv Detail & Related papers (2024-02-29T15:26:14Z)
LLMs May Perform MCQA by Selecting the Least Incorrect Option [29.202758753639078]
Large Language Models (LLMs) have markedly enhanced performance across a variety of tasks.<n>The adoption of Multiple Choice Question Answering (MCQA) as a benchmark for assessing LLMs has gained considerable traction.<n>However, concerns regarding the robustness of this evaluative method persist.
arXiv Detail & Related papers (2024-02-02T12:07:00Z)
NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language Models via Complexity Classes [32.154637177467684]
NPHardEval is designed to evaluate the reasoning abilities of Large Language Models (LLMs) across a broad spectrum of 900 questions. It is meticulously chosen to represent a wide range of complexity class below the NP-hard complexity class. It is designed with a dynamic update mechanism, where the datapoints are refreshed on a monthly basis.
arXiv Detail & Related papers (2023-12-22T18:07:44Z)
Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks. How do we evaluate the capabilities of LLMs to consistently produce factually correct answers? We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z)
Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity [61.54815512469125]
This survey addresses the crucial issue of factuality in Large Language Models (LLMs) As LLMs find applications across diverse domains, the reliability and accuracy of their outputs become vital.
arXiv Detail & Related papers (2023-10-11T14:18:03Z)
Fill in the Blank: Exploring and Enhancing LLM Capabilities for Backward Reasoning in Math Word Problems [17.80128896525717]
backward reasoning is relatively unexplored. backward reasoning can be seen as the ''inverse'' of forward reasoning. We propose variations of three different forward reasoning strategies to improve performance.
arXiv Detail & Related papers (2023-10-03T12:03:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.