Related papers: How Accurately Do Large Language Models Understand Code?

How Accurately Do Large Language Models Understand Code?

URL: http://arxiv.org/abs/2504.04372v2
Date: Wed, 09 Apr 2025 18:27:43 GMT
Title: How Accurately Do Large Language Models Understand Code?
Authors: Sabaat Haroon, Ahmad Faraz Khan, Ahmad Humayun, Waris Gill, Abdul Haddi Amjad, Ali R. Butt, Mohammad Taha Khan, Muhammad Ali Gulzar,
Abstract summary: Large Language Models (LLMs) are increasingly used in post-development tasks such as code repair and testing.<n> Quantifying code comprehension is challenging due to its abstract nature and the lack of a standardized metric.<n>This paper presents the first large-scale empirical investigation into LLMs' ability to understand code.
Score: 4.817546726074033
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are increasingly used in post-development tasks such as code repair and testing. A key factor in these tasks' success is the model's deep understanding of code. However, the extent to which LLMs truly understand code remains largely unevaluated. Quantifying code comprehension is challenging due to its abstract nature and the lack of a standardized metric. Previously, this was assessed through developer surveys, which are not feasible for evaluating LLMs. Existing LLM benchmarks focus primarily on code generation, fundamentally different from code comprehension. Additionally, fixed benchmarks quickly become obsolete as they become part of the training data. This paper presents the first large-scale empirical investigation into LLMs' ability to understand code. Inspired by mutation testing, we use an LLM's fault-finding ability as a proxy for its deep code understanding. This approach is based on the insight that a model capable of identifying subtle functional discrepancies must understand the code well. We inject faults in real-world programs and ask the LLM to localize them, ensuring the specifications suffice for fault localization. Next, we apply semantic-preserving code mutations (SPMs) to the faulty programs and test whether the LLMs still locate the faults, verifying their confidence in code understanding. We evaluate nine popular LLMs on 600,010 debugging tasks from 670 Java and 637 Python programs. We find that LLMs lose the ability to debug the same bug in 78% of faulty programs when SPMs are applied, indicating a shallow understanding of code and reliance on features irrelevant to semantics. We also find that LLMs understand code earlier in the program better than later. This suggests that LLMs' code comprehension remains tied to lexical and syntactic features due to tokenization designed for natural languages, which overlooks code semantics.

Related papers

CodeSimpleQA: Scaling Factuality in Code Large Language Models [55.705748501461294]
We present CodeSimpleQA, a comprehensive benchmark designed to evaluate the factual accuracy of code LLMs in answering code-related questions.<n>We also create CodeSimpleQA-Instruct, a large-scale instruction corpus with 66M samples, and develop a post-training framework combining supervised fine-tuning and reinforcement learning.
arXiv Detail & Related papers (2025-12-22T14:27:17Z)
The Fools are Certain; the Wise are Doubtful: Exploring LLM Confidence in Code Completion [4.215010577170175]
We evaluate the confidence of Large Language Models (LLMs) when generating code by measuring code perplexity.<n>We find that strongly-typed languages exhibit lower perplexity than dynamically typed languages.<n> Perl appears universally high in perplexity, whereas Java appears low.
arXiv Detail & Related papers (2025-08-22T06:51:13Z)
Uncovering Systematic Failures of LLMs in Verifying Code Against Natural Language Specifications [0.6813925418351435]
Large language models (LLMs) have become essential tools in software development, widely used for requirements engineering, code generation and review tasks.<n>In this paper, we uncover a systematic failure of LLMs in evaluating whether code aligns with natural language requirements.<n>Our results reveal that LLMs frequently misclassify correct code implementations as either not satisfying requirements'' or containing potential defects.
arXiv Detail & Related papers (2025-08-17T13:07:26Z)
On the Effectiveness of LLM-as-a-judge for Code Generation and Summarization [54.965787768076254]
Large Language Models have been recently exploited as judges for complex natural language processing tasks, such as Q&A.<n>We study the effectiveness of LLMs-as-a-judge for two code-related tasks, namely code generation and code summarization.
arXiv Detail & Related papers (2025-07-22T13:40:26Z)
LONGCODEU: Benchmarking Long-Context Language Models on Long Code Understanding [69.93924733846576]
Long code understanding benchmark LONGCODEU to evaluate LCLMs' long code understanding ability required for practical applications. LCLMs' performance drops dramatically when the long code length is greater than 32K, falling far short of their claimed 128K-1M context windows. Our study provides valuable insights for optimizing LCLMs and driving advancements in software engineering.
arXiv Detail & Related papers (2025-03-06T12:02:31Z)
Detection of LLM-Paraphrased Code and Identification of the Responsible LLM Using Coding Style Features [5.774786149181392]
alicious users can exploit large language models (LLMs) to produce paraphrased versions of proprietary code that closely resemble the original.<n>We develop LPcodedec, a detection method that identifies paraphrase relationships between human-written and LLM-generated code.<n> LPcodedec outperforms the best baselines in two tasks, improving F1 scores by 2.64% and 15.17% while achieving speedups of 1,343x and 213x, respectively.
arXiv Detail & Related papers (2025-02-25T00:58:06Z)
LLMs' Understanding of Natural Language Revealed [0.0]
Large language models (LLMs) are the result of a massive experiment in bottom-up, data-driven reverse engineering of language at scale. We will focus on testing LLMs for their language understanding capabilities, their supposed forte.
arXiv Detail & Related papers (2024-07-29T01:21:11Z)
CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification [73.66920648926161]
We introduce the concept of code hallucinations and propose a classification method for code hallucination based on execution verification. We present a dynamic detection algorithm called CodeHalu designed to detect and quantify code hallucinations. We also introduce the CodeHaluEval benchmark, which includes 8,883 samples from 699 tasks, to systematically and quantitatively evaluate code hallucinations.
arXiv Detail & Related papers (2024-04-30T23:56:38Z)
InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models [56.723509505549536]
InfiBench is the first large-scale freeform question-answering (QA) benchmark for code to our knowledge. It comprises 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages. We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings.
arXiv Detail & Related papers (2024-03-11T02:06:30Z)
CodeMind: Evaluating Large Language Models for Code Reasoning [6.819757372634151]
Large Language Models (LLMs) have been widely used to automate programming tasks.<n>This paper introduces CodeMind, a framework designed to gauge the code reasoning abilities of LLMs.
arXiv Detail & Related papers (2024-02-15T02:24:46Z)
Assured LLM-Based Software Engineering [51.003878077888686]
This paper is an outline of the content of the keynote by Mark Harman at the International Workshop on Interpretability, Robustness, and Benchmarking in Neural Software Engineering, Monday 15th April 2024, Lisbon, Portugal.
arXiv Detail & Related papers (2024-02-06T20:38:46Z)
Code Prompting Elicits Conditional Reasoning Abilities in Text+Code LLMs [65.2379940117181]
We introduce code prompting, a chain of prompts that transforms a natural language problem into code. We find that code prompting exhibits a high-performance boost for multiple LLMs. Our analysis of GPT 3.5 reveals that the code formatting of the input problem is essential for performance improvement.
arXiv Detail & Related papers (2024-01-18T15:32:24Z)
Mutation-based Consistency Testing for Evaluating the Code Understanding Capability of LLMs [5.549095839198671]
Large Language Models (LLMs) have shown remarkable capabilities in processing both natural and programming languages. We propose a novel method to assess the code understanding performance of LLMs, particularly focusing on subtle differences between code and its descriptions. We apply different types of code mutations, such as operator replacement and statement deletion, to generate inconsistent code-description pairs. We conduct a case study on the two popular LLMs, GPT-3.5 and GPT-4, using the state-of-the-art code generation benchmark, HumanEval-X.
arXiv Detail & Related papers (2024-01-11T14:27:43Z)
If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code Empowers Large Language Models to Serve as Intelligent Agents [81.60906807941188]
Large language models (LLMs) are trained on a combination of natural language and formal language (code) Code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity.
arXiv Detail & Related papers (2024-01-01T16:51:20Z)
CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models [43.655927559990616]
We propose CodeApex, a benchmark dataset focusing on the programming comprehension, code generation, and code correction abilities of LLMs. We evaluate 12 widely used LLMs, including both general-purpose and specialized models. GPT-4 exhibits the best programming capabilities, achieving approximate accuracy of 69%, 54%, and 66% on the three tasks, respectively.
arXiv Detail & Related papers (2023-09-05T04:12:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.