Related papers: GPT-4.1 Sets the Standard in Automated Experiment Design Using Novel Python Libraries

GPT-4.1 Sets the Standard in Automated Experiment Design Using Novel Python Libraries

URL: http://arxiv.org/abs/2508.00033v1
Date: Wed, 30 Jul 2025 13:11:29 GMT
Title: GPT-4.1 Sets the Standard in Automated Experiment Design Using Novel Python Libraries
Authors: Nuno Fachada, Daniel Fernandes, Carlos M. Fernandes, Bruno D. Ferreira-Saraiva, João P. Matos-Carvalho,
Abstract summary: Large Language Models (LLMs) have advanced rapidly as tools for automating code generation in scientific research.<n>This study systematically benchmarks a selection of state-of-the-art LLMs in generating functional Python code for two increasingly challenging scenarios.
Score: 0.7905066238005297
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large Language Models (LLMs) have advanced rapidly as tools for automating code generation in scientific research, yet their ability to interpret and use unfamiliar Python APIs for complex computational experiments remains poorly characterized. This study systematically benchmarks a selection of state-of-the-art LLMs in generating functional Python code for two increasingly challenging scenarios: conversational data analysis with the \textit{ParShift} library, and synthetic data generation and clustering using \textit{pyclugen} and \textit{scikit-learn}. Both experiments use structured, zero-shot prompts specifying detailed requirements but omitting in-context examples. Model outputs are evaluated quantitatively for functional correctness and prompt compliance over multiple runs, and qualitatively by analyzing the errors produced when code execution fails. Results show that only a small subset of models consistently generate correct, executable code, with GPT-4.1 standing out as the only model to always succeed in both tasks. In addition to benchmarking LLM performance, this approach helps identify shortcomings in third-party libraries, such as unclear documentation or obscure implementation bugs. Overall, these findings highlight current limitations of LLMs for end-to-end scientific automation and emphasize the need for careful prompt design, comprehensive library documentation, and continued advances in language model capabilities.

Related papers

MRG-Bench: Evaluating and Exploring the Requirements of Context for Repository-Level Code Generation [0.7342677574855649]
We introduce textbfMRG-Bench, a novel dataset that provides a more accurate evaluation of large language models.<n>We conduct experiments including large language models, long-context models, and RAG-related methods.<n>Results show that the majority of methods suffer from "textbfdifficulty in understanding user requirements," failing to comprehend their assigned tasks accurately.
arXiv Detail & Related papers (2025-08-05T01:53:45Z)
Automated Generation of Commit Messages in Software Repositories [0.7366405857677226]
Commit messages are crucial for documenting software changes, aiding in program comprehension and maintenance.<n>Our research presents an automated approach to generate commit messages using Machine Learning (ML) and Natural Language Processing (NLP)<n>We used the dataset of code changes and corresponding commit messages that was used by Liu et al.
arXiv Detail & Related papers (2025-04-17T15:08:05Z)
Idiosyncrasies in Large Language Models [54.26923012617675]
We unveil and study idiosyncrasies in Large Language Models (LLMs)<n>We find that fine-tuning text embedding models on LLM-generated texts yields excellent classification accuracy.<n>We leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies.
arXiv Detail & Related papers (2025-02-17T18:59:02Z)
FactCG: Enhancing Fact Checkers with Graph-Based Multi-Hop Data [13.108807408880645]
We propose a novel approach for synthetic data generation, CG2C, that leverages multi-hop reasoning on context graphs extracted from documents.<n>Our fact checker model, FactCG, demonstrates improved performance with more connected reasoning, using the same backbone models.
arXiv Detail & Related papers (2025-01-28T18:45:07Z)
SYNTHEVAL: Hybrid Behavioral Testing of NLP Models with Synthetic CheckLists [59.08999823652293]
We propose SYNTHEVAL to generate a wide range of test types for a comprehensive evaluation of NLP models. In the last stage, human experts investigate the challenging examples, manually design templates, and identify the types of failures the taskspecific models consistently exhibit. We apply SYNTHEVAL to two classification tasks, sentiment analysis and toxic language detection, and show that our framework is effective in identifying weaknesses of strong models on these tasks.
arXiv Detail & Related papers (2024-08-30T17:41:30Z)
SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts. We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM. We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z)
DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity. Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data. Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z)
Automatic Generation and Evaluation of Reading Comprehension Test Items with Large Language Models [1.565361244756411]
This paper explores how large language models (LLMs) can be used to generate and evaluate reading comprehension items. We developed a protocol for human and automatic evaluation, including a metric we call text informativity. Our results suggest that both models are capable of generating items of acceptable quality in a zero-shot setting, but GPT-4 clearly outperforms Llama 2.
arXiv Detail & Related papers (2024-04-11T13:11:21Z)
Beyond Traditional Benchmarks: Analyzing Behaviors of Open LLMs on Data-to-Text Generation [0.0]
We analyze the behaviors of open large language models (LLMs) on the task of data-to-text (D2T) generation. We find that open LLMs can generate fluent and coherent texts in zero-shot settings from data in common formats collected with Quintd.
arXiv Detail & Related papers (2024-01-18T18:15:46Z)
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z)
ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases. We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets. Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z)
Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data? [49.688233418425995]
Struc-Bench is a comprehensive benchmark featuring prominent Large Language Models (LLMs) We propose two innovative metrics, P-Score (Prompting Score) and H-Score (Heuristical Score) Our experiments show that applying our structure-aware fine-tuning to LLaMA-7B leads to substantial performance gains.
arXiv Detail & Related papers (2023-09-16T11:31:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.