Related papers: Can LLM Reasoning Be Trusted? A Comparative Study: Using Human Benchmarking on Statistical Tasks

Can LLM Reasoning Be Trusted? A Comparative Study: Using Human Benchmarking on Statistical Tasks

URL: http://arxiv.org/abs/2601.14479v1
Date: Tue, 20 Jan 2026 21:01:08 GMT
Title: Can LLM Reasoning Be Trusted? A Comparative Study: Using Human Benchmarking on Statistical Tasks
Authors: Crish Nagarkar, Leonid Bogachev, Serge Sharoff,
Abstract summary: Large language models (LLMs) solve statistical tasks, as well as their capacity to assess the quality of reasoning.<n>We have fine-tuned selected open-source LLMs on a specially developed dataset to enhance their statistical reasoning capabilities.<n>Our results show that the fine-tuned models achieve better performance on advanced statistical tasks on the level comparable to a statistics student.
Score: 1.5020330976600735
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper investigates the ability of large language models (LLMs) to solve statistical tasks, as well as their capacity to assess the quality of reasoning. While state-of-the-art LLMs have demonstrated remarkable performance in a range of NLP tasks, their competence in addressing even moderately complex statistical challenges is not well understood. We have fine-tuned selected open-source LLMs on a specially developed dataset to enhance their statistical reasoning capabilities, and compared their performance with the human scores used as a benchmark. Our results show that the fine-tuned models achieve better performance on advanced statistical tasks on the level comparable to a statistics student. Fine-tuning demonstrates architecture-dependent improvements, with some models showing significant performance gains, indicating clear potential for deployment in educational technology and statistical analysis assistance systems. We also show that LLMs themselves can be far better judges of the answers quality (including explanation and reasoning assessment) in comparison to traditional metrics, such as BLEU or BertScore. This self-evaluation capability enables scalable automated assessment for statistical education platforms and quality assurance in automated analysis tools. Potential applications also include validation tools for research methodology in academic and industry settings, and quality control mechanisms for data analysis workflows.

Related papers

Automated Analysis of Sustainability Reports: Using Large Language Models for the Extraction and Prediction of EU Taxonomy-Compliant KPIs [21.656551146954587]
Large Language Models (LLMs) offer a path to automation.<n>We introduce a novel, structured dataset from 190 corporate reports.<n>Our results reveal a clear performance gap between qualitative and quantitative tasks.
arXiv Detail & Related papers (2025-12-30T15:28:03Z)
Importance-Aware Data Selection for Efficient LLM Instruction Tuning [12.894727887191621]
We propose the Model Instruction Weakness Value (MIWV) as a novel metric to quantify the importance of instruction data in enhancing model's capabilities.<n>Our experimental results demonstrate that selecting only the top 1% of data based on MIWV can outperform training on the full dataset.
arXiv Detail & Related papers (2025-11-10T13:06:30Z)
BEATS: Bias Evaluation and Assessment Test Suite for Large Language Models [0.0]
We introduce BEATS, a novel framework for evaluating Bias, Ethics, Fairness, and Factuality in Large Language Models (LLMs)<n>We present a bias benchmark for LLMs that measure performance across 29 distinct metrics.<n>These metrics span a broad range of characteristics, including demographic, cognitive, and social biases, as well as measures of ethical reasoning, group fairness, and factuality related misinformation risk.
arXiv Detail & Related papers (2025-03-31T16:56:52Z)
An Overview of Large Language Models for Statisticians [109.38601458831545]
Large Language Models (LLMs) have emerged as transformative tools in artificial intelligence (AI)<n>This paper explores potential areas where statisticians can make important contributions to the development of LLMs.<n>We focus on issues such as uncertainty quantification, interpretability, fairness, privacy, watermarking and model adaptation.
arXiv Detail & Related papers (2025-02-25T03:40:36Z)
Performance Evaluation of Large Language Models in Statistical Programming [9.333703895770913]
Large language models (LLMs) have revolutionized automatic code generation and opened new avenues for automatic statistical analysis.<n>We assess the performance of LLMs, including two versions of ChatGPT and one version of Llama, in the domain of SAS programming for statistical analysis.<n>We conduct a comprehensive assessment of the quality of SAS code generated by LLMs through human expert evaluation based on correctness, effectiveness, readability, executability, and the accuracy of output results.
arXiv Detail & Related papers (2025-02-18T18:37:15Z)
Meta-Statistical Learning: Supervised Learning of Statistical Inference [59.463430294611626]
This work demonstrates that the tools and principles driving the success of large language models (LLMs) can be repurposed to tackle distribution-level tasks.<n>We propose meta-statistical learning, a framework inspired by multi-instance learning that reformulates statistical inference tasks as supervised learning problems.
arXiv Detail & Related papers (2025-02-17T18:04:39Z)
Self-Evolving Critique Abilities in Large Language Models [59.861013614500024]
This paper explores enhancing critique abilities of Large Language Models (LLMs)<n>We introduce SCRIT, a framework that trains LLMs with self-generated data to evolve their critique abilities.<n>Our analysis reveals that SCRIT's performance scales positively with data and model size.
arXiv Detail & Related papers (2025-01-10T05:51:52Z)
Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets. The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method. The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z)
Are Large Language Models Good Statisticians? [10.42853117200315]
StatQA is a new benchmark designed for statistical analysis tasks. We show that even state-of-the-art models such as GPT-4o achieve a best performance of only 64.83%. While open-source LLMs show limited capability, those fine-tuned ones exhibit marked improvements.
arXiv Detail & Related papers (2024-06-12T02:23:51Z)
Evaluating Mathematical Reasoning Beyond Accuracy [50.09931172314218]
We introduce ReasonEval, a new methodology for evaluating the quality of reasoning steps.<n>We show that ReasonEval consistently outperforms baseline methods in the meta-evaluation datasets.<n>We observe that ReasonEval can play a significant role in data selection.
arXiv Detail & Related papers (2024-04-08T17:18:04Z)
Comprehensive Reassessment of Large-Scale Evaluation Outcomes in LLMs: A Multifaceted Statistical Approach [64.42462708687921]
Evaluations have revealed that factors such as scaling, training types, architectures and other factors profoundly impact the performance of LLMs. Our study embarks on a thorough re-examination of these LLMs, targeting the inadequacies in current evaluation methods. This includes the application of ANOVA, Tukey HSD tests, GAMM, and clustering technique.
arXiv Detail & Related papers (2024-03-22T14:47:35Z)
Weak Supervision Performance Evaluation via Partial Identification [46.73061437177238]
Programmatic Weak Supervision (PWS) enables supervised model training without direct access to ground truth labels. We present a novel method to address this challenge by framing model evaluation as a partial identification problem. Our approach derives reliable bounds on key metrics without requiring labeled data, overcoming core limitations in current weak supervision evaluation techniques.
arXiv Detail & Related papers (2023-12-07T07:15:11Z)
From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning [52.257422715393574]
We introduce a self-guided methodology for Large Language Models (LLMs) to autonomously discern and select cherry samples from open-source datasets. Our key innovation, the Instruction-Following Difficulty (IFD) metric, emerges as a pivotal metric to identify discrepancies between a model's expected responses and its intrinsic generation capability.
arXiv Detail & Related papers (2023-08-23T09:45:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.