Performance Evaluation of Large Language Models in Statistical Programming
- URL: http://arxiv.org/abs/2502.13117v1
- Date: Tue, 18 Feb 2025 18:37:15 GMT
- Title: Performance Evaluation of Large Language Models in Statistical Programming
- Authors: Xinyi Song, Kexin Xie, Lina Lee, Ruizhe Chen, Jared M. Clark, Hao He, Haoran He, Jie Min, Xinlei Zhang, Simin Zheng, Zhiyang Zhang, Xinwei Deng, Yili Hong,
- Abstract summary: Large language models (LLMs) have revolutionized automatic code generation and opened new avenues for automatic statistical analysis.<n>We assess the performance of LLMs, including two versions of ChatGPT and one version of Llama, in the domain of SAS programming for statistical analysis.<n>We conduct a comprehensive assessment of the quality of SAS code generated by LLMs through human expert evaluation based on correctness, effectiveness, readability, executability, and the accuracy of output results.
- Score: 9.333703895770913
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The programming capabilities of large language models (LLMs) have revolutionized automatic code generation and opened new avenues for automatic statistical analysis. However, the validity and quality of these generated codes need to be systematically evaluated before they can be widely adopted. Despite their growing prominence, a comprehensive evaluation of statistical code generated by LLMs remains scarce in the literature. In this paper, we assess the performance of LLMs, including two versions of ChatGPT and one version of Llama, in the domain of SAS programming for statistical analysis. Our study utilizes a set of statistical analysis tasks encompassing diverse statistical topics and datasets. Each task includes a problem description, dataset information, and human-verified SAS code. We conduct a comprehensive assessment of the quality of SAS code generated by LLMs through human expert evaluation based on correctness, effectiveness, readability, executability, and the accuracy of output results. The analysis of rating scores reveals that while LLMs demonstrate usefulness in generating syntactically correct code, they struggle with tasks requiring deep domain understanding and may produce redundant or incorrect results. This study offers valuable insights into the capabilities and limitations of LLMs in statistical programming, providing guidance for future advancements in AI-assisted coding systems for statistical analysis.
Related papers
- Benchmarking Large Language Models for Multi-Language Software Vulnerability Detection [15.026084450436976]
We present a study evaluating the performance of large language models (LLMs) on the software vulnerability detection task.
We have compiled a dataset comprising 8,260 vulnerable functions in Python, 7,505 in Java, and 28,983 in JavaScript.
These LLMs are benchmarked against five fine-tuned small language models and two open-source static application security testing tools.
arXiv Detail & Related papers (2025-03-03T11:56:00Z) - StatLLM: A Dataset for Evaluating the Performance of Large Language Models in Statistical Analysis [2.5541378136265047]
StatLLM is an open-source dataset for evaluating the performance of large language models in statistical analysis.
The first component includes statistical analysis tasks spanning a variety of analyses and datasets.
The second component features SAS code generated by ChatGPT 3.5, ChatGPT 4.0, and Llama 3.1 for those tasks.
The third component contains evaluation scores from human experts in assessing the correctness, effectiveness, readability, executability, and output accuracy of the LLM-generated code.
arXiv Detail & Related papers (2025-02-24T21:11:20Z) - An Analyst-Inspector Framework for Evaluating Reproducibility of LLMs in Data Science [5.064778712920176]
Large Language Models (LLMs) have demonstrated potential for data science tasks via code generation.
We propose a novel analyst-inspector framework to automatically evaluate and enforce the of LLM-generated data science.
arXiv Detail & Related papers (2025-02-23T01:15:50Z) - SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors [0.0]
Large language models (LLMs) have demonstrated remarkable capabilities in code-related tasks, such as code understanding and code generation.<n>However, an equally important yet underexplored question is whether LLMs can serve as general-purpose surrogate code executors.<n>This study provides empirical insights into the feasibility of using LLMs as surrogate code executors.
arXiv Detail & Related papers (2025-02-16T15:38:19Z) - Context is Key: A Benchmark for Forecasting with Essential Textual Information [87.3175915185287]
"Context is Key" (CiK) is a forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context.<n>We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters.<n>We propose a simple yet effective LLM prompting method that outperforms all other tested methods on our benchmark.
arXiv Detail & Related papers (2024-10-24T17:56:08Z) - A Survey on Evaluating Large Language Models in Code Generation Tasks [30.256255254277914]
This paper provides a comprehensive review of the current methods and metrics used to evaluate the performance of Large Language Models (LLMs) in code generation tasks.
With the rapid growth in demand for automated software development, LLMs have demonstrated significant potential in the field of code generation.
arXiv Detail & Related papers (2024-08-29T12:56:06Z) - SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models [54.78329741186446]
We propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation.
Experiments across both in-domain and out-of-domain benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
arXiv Detail & Related papers (2024-08-28T06:33:03Z) - Are Large Language Models Good Statisticians? [10.42853117200315]
StatQA is a new benchmark designed for statistical analysis tasks.
We show that even state-of-the-art models such as GPT-4o achieve a best performance of only 64.83%.
While open-source LLMs show limited capability, those fine-tuned ones exhibit marked improvements.
arXiv Detail & Related papers (2024-06-12T02:23:51Z) - Evaluating Mathematical Reasoning Beyond Accuracy [50.09931172314218]
We introduce ReasonEval, a new methodology for evaluating the quality of reasoning steps.<n>We show that ReasonEval consistently outperforms baseline methods in the meta-evaluation datasets.<n>We observe that ReasonEval can play a significant role in data selection.
arXiv Detail & Related papers (2024-04-08T17:18:04Z) - Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data [89.2410799619405]
We introduce the Quantitative Reasoning with Data benchmark to evaluate Large Language Models' capability in statistical and causal reasoning with real-world data.
The benchmark comprises a dataset of 411 questions accompanied by data sheets from textbooks, online learning materials, and academic papers.
To compare models' quantitative reasoning abilities on data and text, we enrich the benchmark with an auxiliary set of 290 text-only questions, namely QRText.
arXiv Detail & Related papers (2024-02-27T16:15:03Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - How Does Naming Affect LLMs on Code Analysis Tasks? [8.150719423943109]
Large Language Models (LLMs) were proposed for natural language processing (NLP) and have shown promising results as general-purpose language models.
This paper investigates how naming affects LLMs on code analysis tasks.
arXiv Detail & Related papers (2023-07-24T02:38:24Z) - Sentiment Analysis in the Era of Large Language Models: A Reality Check [69.97942065617664]
This paper investigates the capabilities of large language models (LLMs) in performing various sentiment analysis tasks.
We evaluate performance across 13 tasks on 26 datasets and compare the results against small language models (SLMs) trained on domain-specific datasets.
arXiv Detail & Related papers (2023-05-24T10:45:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.