Related papers: When Names Disappear: Revealing What LLMs Actually Understand About Code

When Names Disappear: Revealing What LLMs Actually Understand About Code

URL: http://arxiv.org/abs/2510.03178v1
Date: Fri, 03 Oct 2025 16:53:13 GMT
Title: When Names Disappear: Revealing What LLMs Actually Understand About Code
Authors: Cuong Chi Le, Minh V. T. Pham, Cuong Duc Van, Hoang N. Phan, Huy N. Phan, Tien N. Nguyen,
Abstract summary: Large Language Models (LLMs) achieve strong results on code tasks, but how they derive program meaning remains unclear.<n>We argue that code communicates through two channels: structural semantics, which define formal behavior, and human-interpretable naming, which conveys intent.<n>Removing the naming channel severely degrades intent-level tasks such as summarization, where models regress to line-by-line descriptions.
Score: 7.691597373321699
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) achieve strong results on code tasks, but how they derive program meaning remains unclear. We argue that code communicates through two channels: structural semantics, which define formal behavior, and human-interpretable naming, which conveys intent. Removing the naming channel severely degrades intent-level tasks such as summarization, where models regress to line-by-line descriptions. Surprisingly, we also observe consistent reductions on execution tasks that should depend only on structure, revealing that current benchmarks reward memorization of naming patterns rather than genuine semantic reasoning. To disentangle these effects, we introduce a suite of semantics-preserving obfuscations and show that they expose identifier leakage across both summarization and execution. Building on these insights, we release ClassEval-Obf, an obfuscation-enhanced benchmark that systematically suppresses naming cues while preserving behavior. Our results demonstrate that ClassEval-Obf reduces inflated performance gaps, weakens memorization shortcuts, and provides a more reliable basis for assessing LLMs' code understanding and generalization.

Related papers

The Semantic Trap: Do Fine-tuned LLMs Learn Vulnerability Root Cause or Just Functional Pattern? [14.472036099680961]
We propose TrapEval, a comprehensive evaluation framework designed to disentangle vulnerability root cause from functional pattern.<n>We fine-tune five representative state-of-the-art LLMs across three model families and evaluate them under cross-dataset testing, semantic-preservings, and varying degrees of semantic gap measured by CodeBLEU.<n>Our findings serve as a wake-up call: high benchmark scores on traditional datasets may be illusory, masking the model's inability to understand the true causal logic of vulnerabilities.
arXiv Detail & Related papers (2026-01-30T07:19:17Z)
Towards Benchmarking Design Pattern Detection Under Obfuscation: Reproducing and Evaluating Attention-Based Detection Method [2.1843439591862333]
We reproduce the DPDAtt, an attention-based design pattern detection approach using learning-based classifiers, and evaluate its performance under obfuscation.<n>Our findings reveal that these trained classifiers depend significantly on superficial syntactic features, leading to substantial misclassification when such cues are removed.<n>This work highlights the need for more robust detection tools capable of capturing deeper semantic meanings in source code.
arXiv Detail & Related papers (2025-12-08T06:10:34Z)
Weakly-Supervised Contrastive Learning for Imprecise Class Labels [50.57424331797865]
We introduce the concept of continuous semantic similarity'' to define positive and negative pairs.<n>We propose a graph-theoretic framework for weakly-supervised contrastive learning.<n>Our framework is highly versatile and can be applied to many weakly-supervised learning scenarios.
arXiv Detail & Related papers (2025-05-28T06:50:40Z)
Sense and Sensitivity: Examining the Influence of Semantic Recall on Long Context Code Reasoning [9.719614935865906]
This paper investigates Large Language Models (LLMs) reasoning ability over code snippets within large repositories.<n>We differentiate between lexical code recall (verbatim retrieval) and semantic code recall (remembering what the code does)<n>Our evaluation of state-of-the-art LLMs reveals a significant drop in code reasoning accuracy as a code snippet approaches the middle of the input context.
arXiv Detail & Related papers (2025-05-19T16:56:31Z)
Memorize or Generalize? Evaluating LLM Code Generation with Code Rewriting [54.48306552577881]
We argue that large language models (LLMs) are mostly doing memorization (i.e., replicating or reusing large parts of their training data) versus generalization.<n>Existing evaluations largely proxy neglecting surface/structural similarity, thereby conflating benign reuse of repeated code with harmful recall and memorization task correctness.<n>We propose Memorization Risk Index (MRI), a normalized score that combines two signals: (i) how similar the model's answer for the rewritten task is to the original ground-truth solution, and (ii) how much performance drops from the original task to its rewritten counterpart.
arXiv Detail & Related papers (2025-03-04T05:39:24Z)
Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders [29.356200147371275]
Large language models (LLMs) excel at handling human queries, but they can occasionally generate flawed or unexpected responses.<n>We propose using a fixed vocabulary set for feature interpretations and designing a mutual information-based objective.<n>We propose two runtime steering strategies that adjust the learned feature activations based on their corresponding explanations.
arXiv Detail & Related papers (2025-02-21T16:36:42Z)
Disentangling Memory and Reasoning Ability in Large Language Models [97.26827060106581]
We propose a new inference paradigm that decomposes the complex inference process into two distinct and clear actions.<n>Our experiment results show that this decomposition improves model performance and enhances the interpretability of the inference process.
arXiv Detail & Related papers (2024-11-20T17:55:38Z)
Mitigating Semantic Leakage in Cross-lingual Embeddings via Orthogonality Constraint [10.747248747425957]
Current disentangled representation learning methods suffer from semantic leakage.<n>We propose a novel training objective, ORthogonAlity Constraint LEarning (ORACLE)<n>ORACLE builds upon two components: intra-class clustering and inter-class separation.<n>We demonstrate that training with the ORACLE objective effectively reduces semantic leakage and enhances semantic alignment within the embedding space.
arXiv Detail & Related papers (2024-09-24T02:01:52Z)
Fantastic Semantics and Where to Find Them: Investigating Which Layers of Generative LLMs Reflect Lexical Semantics [50.982315553104975]
We investigate the bottom-up evolution of lexical semantics for a popular large language model, namely Llama2. Our experiments show that the representations in lower layers encode lexical semantics, while the higher layers, with weaker semantic induction, are responsible for prediction. This is in contrast to models with discriminative objectives, such as mask language modeling, where the higher layers obtain better lexical semantics.
arXiv Detail & Related papers (2024-03-03T13:14:47Z)
Contrastive Instruction Tuning [61.97704869248903]
We propose Contrastive Instruction Tuning to maximize the similarity between semantically equivalent instruction-instance pairs. Experiments on the PromptBench benchmark show that CoIN consistently improves LLMs' robustness to unseen instructions with variations across character, word, sentence, and semantic levels by an average of +2.5% in accuracy.
arXiv Detail & Related papers (2024-02-17T00:09:32Z)
Waffling around for Performance: Visual Classification with Random Words and Broad Concepts [121.60918966567657]
WaffleCLIP is a framework for zero-shot visual classification which simply replaces LLM-generated descriptors with random character and word descriptors. We conduct an extensive experimental study on the impact and shortcomings of additional semantics introduced with LLM-generated descriptors.
arXiv Detail & Related papers (2023-06-12T17:59:48Z)
Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds [59.71218039095155]
We evaluate language understanding capacities on simple inference tasks that most humans find trivial. We target (i) grammatically-specified entailments, (ii) premises with evidential adverbs of uncertainty, and (iii) monotonicity entailments. The models exhibit moderate to low performance on these evaluation sets.
arXiv Detail & Related papers (2023-05-24T06:41:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.