Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Class-Level Code Generation
- URL: http://arxiv.org/abs/2510.26130v2
- Date: Tue, 04 Nov 2025 21:33:03 GMT
- Title: Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Class-Level Code Generation
- Authors: Musfiqur Rahman, SayedHassan Khatoonabadi, Emad Shihab,
- Abstract summary: Large language models (LLMs) have demonstrated strong performance on function-level code generation benchmarks.<n>We introduce a benchmark derived from real-world open-source repositories to evaluate generalization under practical conditions.<n>We examine how input specification completeness and retrieval-augmented generation affect class-level correctness across multiple state-of-the-art LLMs.
- Score: 3.9189409002585567
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have demonstrated strong performance on function-level code generation benchmarks, yet real-world software development increasingly demands class-level implementations that integrate multiple methods, attributes, and dependencies within authentic project contexts. This gap between benchmark performance and practical utility raises critical questions about LLMs' readiness for production code assistance, particularly regarding their ability to generalize across familiar and novel codebases. We introduce a benchmark derived from real-world open-source repositories, comprising classes divided into seen and unseen partitions to evaluate generalization under practical conditions. We systematically examine how input specification completeness and retrieval-augmented generation affect class-level correctness across multiple state-of-the-art LLMs. Our evaluation reveals a substantial performance gap: while LLMs achieve 84 to 89% correctness on synthetic benchmarks, they attain only 25 to 34% on real-world class tasks, with minimal distinction between familiar and novel codebases. Comprehensive documentation provides marginal improvements (1 to 3%), whereas retrieval augmentation yields greater gains (4 to 7%) by supplying concrete implementation patterns. Error analysis identifies AttributeError, TypeError, and AssertionError as dominant failure modes, with distinct patterns between synthetic and real-world scenarios. These findings provide actionable insights for enhancing context modelling, documentation strategies, and retrieval integration in production code assistance tools.
Related papers
- Enhancing the Medical Context-Awareness Ability of LLMs via Multifaceted Self-Refinement Learning [49.559151128219725]
Large language models (LLMs) have shown great promise in the medical domain, achieving strong performance on several benchmarks.<n>However, they continue to underperform in real-world medical scenarios, which often demand stronger context-awareness.<n>We propose Multifaceted Self-Refinement (MuSeR), a data-driven approach that enhances LLMs' context-awareness along three key facets.
arXiv Detail & Related papers (2025-11-13T08:13:23Z) - Evaluating Large Language Models for Functional and Maintainable Code in Industrial Settings: A Case Study at ASML [3.5515013986822073]
We present a case study conducted in collaboration with the leveling department at A.<n>We investigate the performance of LLMs in generating functional, maintainable code within a closed, highly specialized software environment.<n>The findings reveal that prompting techniques and model size have a significant impact on output quality.
arXiv Detail & Related papers (2025-09-15T19:39:26Z) - Benchmarking LLMs for Unit Test Generation from Real-World Functions [34.70460519870186]
We introduce ULT (UnLeakedTestbench), a new benchmark specifically designed for function-level unit test generation from real-world Python functions.<n>With 3,909 carefully selected function-level tasks, ULT provides a more realistic and challenging evaluation of LLMs' test generation capabilities.<n>Our evaluation results demonstrate that ULT is significantly more challenging.
arXiv Detail & Related papers (2025-08-01T08:08:26Z) - SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories? [32.67971774793393]
SWE-Perf is the first benchmark designed to evaluate Large Language Models (LLMs) on code performance optimization tasks within authentic repository contexts.<n>SWE-Perf comprises 140 carefully curated instances, each derived from performance-improving pull requests from popular GitHub repositories.
arXiv Detail & Related papers (2025-07-16T17:05:17Z) - The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason [1.6249398255272318]
We present empirical evidence that performance gains on SWE-Bench-Verified may be partially driven by memorization rather than genuine problem-solving.<n>We show that state-of-the-art models achieve up to 76% accuracy in identifying buggy file paths using only issue descriptions, without access to repository structure.<n>These findings raise concerns about the validity of existing results and underscore the need for more robust, contamination-resistant benchmarks.
arXiv Detail & Related papers (2025-06-14T00:25:26Z) - IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis [60.32962597618861]
IDA-Bench is a novel benchmark evaluating large language models in multi-round interactive scenarios.<n>Agent performance is judged by comparing its final numerical output to the human-derived baseline.<n>Even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests.
arXiv Detail & Related papers (2025-05-23T09:37:52Z) - A Large-scale Class-level Benchmark Dataset for Code Generation with LLMs [3.458772578520879]
We introduce a large-scale, Python class-level dataset curated from $13,174$ real-world open-source projects.<n>The dataset contains over 842,000 class skeletons, each including class and method signatures, along with associated docstrings when available.<n>We use extracted class skeletons as prompts for GPT-4 to generate full class implementations.<n>Results show that the LLM-generated classes exhibit strong lexical and structural similarity to human-written counterparts, with average ROUGE@L, BLEU, and TSED scores of 0.80, 0.59, and 0.73, respectively.
arXiv Detail & Related papers (2025-04-22T03:33:57Z) - The Vulnerability of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance? [1.3810901729134184]
Large Language Models (LLMs) excel at standardized tests while failing to demonstrate genuine language understanding and adaptability.<n>Our systematic analysis of NLP evaluation frameworks reveals pervasive vulnerabilities across the evaluation spectrum.<n>We lay the groundwork for new evaluation methods that resist manipulation, minimize data contamination, and assess domain-specific tasks.
arXiv Detail & Related papers (2024-12-02T20:49:21Z) - ClassEval-T: Evaluating Large Language Models in Class-Level Code Translation [19.69195067838796]
We construct a class-level code translation benchmark, ClassEval-T, and make the first attempt to extensively assess recent LLMs' performance on class-level code translation.<n>It cost us 360 person-hours to accomplish the manual migration to Java and C++ with complete code samples and associated test suites.<n> Experimental results demonstrate a remarkable performance drop compared with the most widely studied method-level code translation benchmark.
arXiv Detail & Related papers (2024-11-09T11:13:14Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [92.62952504133926]
This study evaluated the performance of three leading closed-source LLMs and six popular open-source LLMs on three commonly used benchmarks.<n>We developed a taxonomy of bugs for incorrect codes and analyzed the root cause for common bug types.<n>We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity.
Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data.
Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - Class-Level Code Generation from Natural Language Using Iterative, Tool-Enhanced Reasoning over Repository [4.767858874370881]
We introduce RepoClassBench, a benchmark designed to rigorously evaluate LLMs in generating class-level code within real-world repositories.
RepoClassBench includes "Natural Language to Class generation" tasks across Java, Python & C# from a selection of repositories.
We introduce Retrieve-Repotools-Reflect (RRR), a novel approach that equips LLMs with static analysis tools to iteratively navigate & reason about repository-level context.
arXiv Detail & Related papers (2024-04-22T03:52:54Z) - Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization [132.25202059478065]
We benchmark large language models (LLMs) on instruction controllable text summarization.
Our study reveals that instruction controllable text summarization remains a challenging task for LLMs.
arXiv Detail & Related papers (2023-11-15T18:25:26Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.