CODECRASH: Stress Testing LLM Reasoning under Structural and Semantic Perturbations
- URL: http://arxiv.org/abs/2504.14119v1
- Date: Sat, 19 Apr 2025 00:40:28 GMT
- Title: CODECRASH: Stress Testing LLM Reasoning under Structural and Semantic Perturbations
- Authors: Man Ho Lam, Chaozheng Wang, Jen-tse Huang, Michael R. Lyu,
- Abstract summary: We present CodeCrash, a unified benchmark that evaluates robustness under code structural and textual distraction perturbations.<n>We evaluate seventeen Large Language Models (LLMs) using direct and Chain-of-Thought inference.<n>Our findings reveal the fragility of LLMs under structural noise and the inherent reliance on natural language cues.
- Score: 36.60702578561009
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have recently showcased strong capabilities in code-related tasks, yet their robustness in code comprehension and reasoning remains underexplored. In this paper, we present CodeCrash, a unified benchmark that evaluates LLM robustness under code structural and textual distraction perturbations, applied to two established benchmarks -- CRUXEval and LiveCodeBench -- across both input and output prediction tasks. We evaluate seventeen LLMs using direct and Chain-of-Thought inference to systematically analyze their robustness, identify primary reasons for performance degradation, and highlight failure modes. Our findings reveal the fragility of LLMs under structural noise and the inherent reliance on natural language cues, highlighting critical robustness issues of LLMs in code execution and understanding. Additionally, we examine three Large Reasoning Models (LRMs) and discover the severe vulnerability of self-reflective reasoning mechanisms that lead to reasoning collapse. CodeCrash provides a principled framework for stress-testing LLMs in code understanding, offering actionable directions for future evaluation and benchmarking. The code of CodeCrash and the robustness leaderboard are publicly available at https://donaldlamnl.github.io/CodeCrash/ .
Related papers
- MOCHA: Are Code Language Models Robust Against Multi-Turn Malicious Coding Prompts? [12.213189431386478]
We introduce code decomposition attacks, where a malicious coding task is broken down into seemingly benign subtasks to evade safety filters.<n>To facilitate systematic evaluation, we introduce benchmarkname, a large-scale benchmark designed to evaluate robustness of code LLMs against both single-turn and multi-turn malicious prompts.<n>Fine-tuning on MOCHA improves rejection rates while preserving coding ability, and importantly, enhances robustness on external adversarial datasets with up to 32.4% increase in rejection rates without any additional supervision.
arXiv Detail & Related papers (2025-07-25T18:11:10Z) - R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning [60.37610817226533]
Chain-of-thought (CoT) reasoning encourages step-by-step intermediate reasoning during inference.<n>CoT introduces substantial computational overhead due to its reliance on autoregressive decoding over long token sequences.<n>We present R-Stitch, a token-level, confidence-based hybrid decoding framework that accelerates CoT inference.
arXiv Detail & Related papers (2025-07-23T08:14:36Z) - When LLMs Copy to Think: Uncovering Copy-Guided Attacks in Reasoning LLMs [30.532439965854767]
Large Language Models (LLMs) have become integral to automated code analysis, enabling tasks such as vulnerability detection and code comprehension.<n>In this paper, we identify and investigate a new class of prompt-based attacks, termed Copy-Guided Attacks (CGA)<n>We show that CGA reliably induces infinite loops, premature termination, false refusals, and semantic distortions in code analysis tasks.
arXiv Detail & Related papers (2025-07-22T17:21:36Z) - Chain-of-Code Collapse: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation [0.3495246564946556]
Large Language Models (LLMs) have achieved remarkable success in tasks requiring complex reasoning.<n>Do these models truly reason, or do they merely exploit shallow statistical patterns?<n>We introduce Chain-of-Code Collapse, where we investigate the robustness of reasoning LLMs by introducing a suite of semantically faithful yet adversarially structured prompt perturbations.
arXiv Detail & Related papers (2025-06-08T02:43:46Z) - ICPC-Eval: Probing the Frontiers of LLM Reasoning with Competitive Programming Contests [85.72404266850982]
We propose textbfICPC-Eval, a top-level competitive coding benchmark designed to probing the frontiers of reasoning.<n>ICPC-Eval includes 118 carefully curated problems from 11 recent ICPC contests held in various regions of the world.<n>Results underscore the significant challenge in evaluating complex reasoning abilities.
arXiv Detail & Related papers (2025-06-05T11:20:37Z) - Training Language Models to Generate Quality Code with Program Analysis Feedback [66.0854002147103]
Code generation with large language models (LLMs) is increasingly adopted in production but fails to ensure code quality.<n>We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code.
arXiv Detail & Related papers (2025-05-28T17:57:47Z) - Everything You Wanted to Know About LLM-based Vulnerability Detection But Were Afraid to Ask [30.819697001992154]
Large Language Models are a promising tool for automated vulnerability detection.
Despite widespread adoption, a critical question remains: Are LLMs truly effective at detecting real-world vulnerabilities?
This paper challenges three widely held community beliefs: that LLMs are (i) unreliable, (ii) insensitive to code patches, and (iii) performance-plateaued across model scales.
arXiv Detail & Related papers (2025-04-18T05:32:47Z) - CRANE: Reasoning with constrained LLM generation [5.971462597321995]
We propose a reasoning-augmented constrained decoding algorithm, CRANE, which balances correctness of constrained generation with flexibility of unconstrained generation.
CRANE significantly outperforms both state-of-the-art constrained decoding strategies and standard unconstrained decoding.
arXiv Detail & Related papers (2025-02-13T08:23:42Z) - What You See Is Not Always What You Get: An Empirical Study of Code Comprehension by Large Language Models [0.5735035463793009]
We investigate the vulnerability of large language models (LLMs) to imperceptible attacks, where hidden character manipulation in source code misleads LLMs' behaviour while remaining undetectable to human reviewers.<n>These attacks include coding reordering, invisible coding characters, code deletions, and code homoglyphs.<n>Our findings confirm the susceptibility of LLMs to imperceptible coding character attacks, while different LLMs present different negative correlations between perturbation magnitude and performance.
arXiv Detail & Related papers (2024-12-11T04:52:41Z) - OCEAN: Offline Chain-of-thought Evaluation and Alignment in Large Language Models [68.17018458283651]
This work focuses on the offline evaluation of the chain-of-thought capabilities of LLMs.
We use knowledge graphs (e.g., Wikidata5m) to provide feedback on the generated chain of thoughts.
We show how to optimize LLMs based on the proposed evaluation method.
arXiv Detail & Related papers (2024-10-31T07:48:44Z) - Understanding Defects in Generated Codes by Language Models [0.669087470775851]
This study categorizes and analyzes 367 identified defects from code snippets generated by Large Language Models.
Error categories indicate key areas where LLMs frequently fail, underscoring the need for targeted improvements.
This paper implemented five prompt engineering techniques, including Scratchpad Prompting, Program of Thoughts Prompting, Chain-of-Thought Prompting, Chain-of-Thought Prompting, and Structured Chain-of-Thought Prompting.
arXiv Detail & Related papers (2024-08-23T21:10:09Z) - Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses.
Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives.
The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z) - What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions.
We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types.
We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models [12.112914393948415]
We present RUPBench, a benchmark designed to evaluate large language models (LLMs) across diverse reasoning tasks.
Our benchmark incorporates 15 reasoning datasets, categorized into commonsense, arithmetic, logical, and knowledge-intensive reasoning.
By examining the performance of state-of-the-art LLMs such as GPT-4o, Llama3, Phi-3, and Gemma on both original and perturbed datasets, we provide a detailed analysis of their robustness and error patterns.
arXiv Detail & Related papers (2024-06-16T17:26:44Z) - ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs [95.15814662348245]
Compositional Reasoning (CR) entails grasping the significance of attributes, relations, and word order.
Recent Vision-Language Models (VLMs) have demonstrated remarkable proficiency in such reasoning tasks.
arXiv Detail & Related papers (2024-06-12T12:54:27Z) - Reasoning Runtime Behavior of a Program with LLM: How Far Are We? [25.451857140926943]
Large language models for code (i.e., code LLMs) have shown strong code understanding and generation capabilities.
Code reasoning is one of the most essential abilities of code LLMs.
We propose a framework, namely REval, for evaluating code reasoning abilities and consistency of code LLMs with program execution.
arXiv Detail & Related papers (2024-03-25T05:37:16Z) - Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions [47.83142414018448]
We focus on two popular reasoning tasks: arithmetic reasoning and code generation.
We introduce (i) a general ontology of perturbations for math and coding questions, (ii) a semi-automatic method to apply these perturbations, and (iii) two datasets.
We show a significant performance drop across all the models against perturbed questions.
arXiv Detail & Related papers (2024-01-17T18:13:07Z) - Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks.
How do we evaluate the capabilities of LLMs to consistently produce factually correct answers?
We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z) - Benchmarking and Explaining Large Language Model-based Code Generation:
A Causality-Centric Approach [12.214585409361126]
Large language models (LLMs)- based code generation is a complex and powerful black-box model.
We propose a novel causal graph-based representation of the prompt and the generated code.
We illustrate the insights that our framework can provide by studying over 3 popular LLMs with over 12 prompt adjustment strategies.
arXiv Detail & Related papers (2023-10-10T14:56:26Z) - Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools.
Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions.
Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.