Related papers: Where Do LLMs Still Struggle? An In-Depth Analysis of Code Generation Benchmarks

Where Do LLMs Still Struggle? An In-Depth Analysis of Code Generation Benchmarks

URL: http://arxiv.org/abs/2511.04355v1
Date: Thu, 06 Nov 2025 13:38:03 GMT
Title: Where Do LLMs Still Struggle? An In-Depth Analysis of Code Generation Benchmarks
Authors: Amir Molzam Sharifloo, Maedeh Heydari, Parsa Kazerooni, Daniel Maninger, Mira Mezini,
Abstract summary: Large Language Models (LLMs) have achieved remarkable success in code generation.<n>We examined code generation tasks across four popular benchmarks, identifying those that major LLMs are most likely to fail.<n>Our analysis revealed four recurring patterns of weaknesses in LLMs, as well as common complications within benchmark tasks that most often lead to failure.
Score: 2.979785960419084
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have achieved remarkable success in code generation, and the race to improve their performance has become a central focus of AI research. Benchmarks and leaderboards are increasingly popular, offering quantitative rankings of LLMs. However, they provide limited insight into the tasks that LLMs consistently fail to solve - information that is crucial for understanding current limitations and guiding the development of more capable models. To address this gap, we examined code generation tasks across four popular benchmarks, identifying those that major LLMs are most likely to fail. To understand the causes of these failures, we investigated whether the static complexity of solution code contributes to them, followed by a systematic inspection of 114 tasks that LLMs consistently struggled with. Our analysis revealed four recurring patterns of weaknesses in LLMs, as well as common complications within benchmark tasks that most often lead to failure.

Related papers

LLM-based Vulnerability Discovery through the Lens of Code Metrics [6.339440992743381]
Large language models (LLMs) excel in many tasks of software engineering.<n> progress in leveraging them for vulnerability discovery has stalled in recent years.
arXiv Detail & Related papers (2025-09-23T15:03:05Z)
LLM4VV: Evaluating Cutting-Edge LLMs for Generation and Evaluation of Directive-Based Parallel Programming Model Compiler Tests [7.6818904666624395]
This paper proposes a dual-LLM system and experiments with the usage of LLMs for the generation of compiler tests.<n>It is evident that LLMs possess the promising potential to generate quality compiler tests and verify them automatically.
arXiv Detail & Related papers (2025-07-29T02:34:28Z)
Truly Assessing Fluid Intelligence of Large Language Models through Dynamic Reasoning Evaluation [106.17986469245302]
Large language models (LLMs) have demonstrated impressive reasoning capacities that mirror human-like thinking.<n>Existing reasoning benchmarks either focus on domain-specific knowledge (crystallized intelligence) or lack interpretability.<n>We propose DRE-Bench, a dynamic reasoning evaluation benchmark grounded in a hierarchical cognitive framework.
arXiv Detail & Related papers (2025-06-03T09:01:08Z)
Generative Evaluation of Complex Reasoning in Large Language Models [39.195491367590485]
We introduce KUMO, a generative evaluation framework designed specifically for assessing reasoning in large language models (LLMs)<n>Through an automated pipeline, KUMO continuously generates novel tasks across open-ended domains, compelling models to demonstrate genuine generalization rather than superhuman memorization.<n>We evaluate 23 state-of-the-art LLMs on 5,000 tasks across 100 domains created by KUMO, benchmarking their reasoning abilities against university students.
arXiv Detail & Related papers (2025-04-03T17:54:18Z)
Control Large Language Models via Divide and Conquer [94.48784966256463]
This paper investigates controllable generation for large language models (LLMs) with prompt-based control, focusing on Lexically Constrained Generation (LCG) We evaluate the performance of LLMs on satisfying lexical constraints with prompt-based control, as well as their efficacy in downstream applications.
arXiv Detail & Related papers (2024-10-06T21:20:06Z)
Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses. Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives. The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [92.62952504133926]
This study evaluated the performance of three leading closed-source LLMs and six popular open-source LLMs on three commonly used benchmarks.<n>We developed a taxonomy of bugs for incorrect codes and analyzed the root cause for common bug types.<n>We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks. LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning. We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z)
Evaluation and Improvement of Fault Detection for Large Language Models [30.760472387136954]
This paper investigates the effectiveness of existing fault detection methods for large language models (LLMs) We propose textbfMuCS, a prompt textbfMutation-based prediction textbfConfidence textbfSmoothing framework to boost the fault detection capability of existing methods.
arXiv Detail & Related papers (2024-04-14T07:06:12Z)
TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety. Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs. We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.