Related papers: Beyond Memorization: Evaluating the True Type Inference Capabilities of LLMs for Java Code Snippets

Beyond Memorization: Evaluating the True Type Inference Capabilities of LLMs for Java Code Snippets

URL: http://arxiv.org/abs/2503.04076v1
Date: Thu, 06 Mar 2025 04:13:40 GMT
Title: Beyond Memorization: Evaluating the True Type Inference Capabilities of LLMs for Java Code Snippets
Authors: Yiwen Dong, Zhenyang Xu, Yongqiang Tian, Chengnian Sun,
Abstract summary: Recent studies have leveraged Large Language Models for type inference on code snippets, showing promising results.<n>However, these results are potentially affected by data leakage, as the benchmark suite (StatType-SO) has been public on GitHub since 2017.<n>We conducted a three-pronged evaluation to comprehensively assess LLMs' type inference capabilities on Java code snippets.
Score: 3.152174935904172
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Type inference is a crucial task for reusing online code snippets, often found on platforms like StackOverflow, which frequently lack essential type information such as fully qualified names (FQNs) and required libraries. Recent studies have leveraged Large Language Models (LLMs) for type inference on code snippets, showing promising results. However, these results are potentially affected by data leakage, as the benchmark suite (StatType-SO) has been public on GitHub since 2017 (full suite in 2023). Thus, it is uncertain whether LLMs' strong performance reflects genuine code semantics understanding or a mere retrieval of ground truth from training data. To comprehensively assess LLMs' type inference capabilities on Java code snippets, we conducted a three-pronged evaluation. First, utilizing Thalia, a program synthesis technique, we created ThaliaType--a new, unseen dataset for type inference evaluation. On unseen snippets, LLM performance dropped significantly, with up to a 59% decrease in precision and 72% in recall. Second, we developed semantic-preserving transformations that significantly degraded LLMs' type inference performance, revealing weaknesses in understanding code semantics. Third, we used delta debugging to identify the minimal syntax elements sufficient for LLM inference. While type inference primarily involves inferring FQNs for types in the code snippet, LLMs correctly infer FQNs even when the types were absent from the snippets, suggesting a reliance on knowledge from training instead of thoroughly analyzing the snippets. Our findings indicate that LLMs' strong past performance likely stemmed from data leakage, rather than a genuine understanding of the semantics of code snippets. Our findings highlight the crucial need for carefully designed benchmarks using unseen code snippets to assess the true capabilities of LLMs for type inference tasks.

Related papers

Leveraging LLMs for Utility-Focused Annotation: Reducing Manual Effort for Retrieval and RAG [69.51637252264277]
We investigate whether Large Language Models (LLMs) can effectively replace human annotations in training retrieval models. Our experiments show that retrievers trained on utility-focused annotations significantly outperform those trained on human annotations in the out-of-domain setting. Just 20% human-annotated data enables retrievers trained with utility-focused annotations to match the performance of models trained entirely with human annotations.
arXiv Detail & Related papers (2025-04-07T16:05:52Z)
How Accurately Do Large Language Models Understand Code? [4.817546726074033]
Large Language Models (LLMs) are increasingly used in post-development tasks such as code repair and testing. Quantifying code comprehension is challenging due to its abstract nature and the lack of a standardized metric. This paper presents the first large-scale empirical investigation into LLMs' ability to understand code.
arXiv Detail & Related papers (2025-04-06T05:59:29Z)
Idiosyncrasies in Large Language Models [54.26923012617675]
We unveil and study idiosyncrasies in Large Language Models (LLMs) We find that fine-tuning existing text embedding models on LLM-generated texts yields excellent classification accuracy. We leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies.
arXiv Detail & Related papers (2025-02-17T18:59:02Z)
Generating Unseen Code Tests In Infinitum [1.0674604700001968]
We present a method for creating benchmark variations that generalize across coding tasks and programming languages. We implement one benchmark, called textitauto-regression, for the task of text-to-code generation in Python.
arXiv Detail & Related papers (2024-07-29T08:11:20Z)
Source Code Summarization in the Era of Large Language Models [23.715005053430957]
Large language models (LLMs) have led to a great boost in the performance of code-related tasks. In this paper, we undertake a systematic and comprehensive study on code summarization in the era of LLMs.
arXiv Detail & Related papers (2024-07-09T05:48:42Z)
Bug In the Code Stack: Can LLMs Find Bugs in Large Python Code Stacks [1.3586572110652484]
This study explores the capabilities of Large Language Models (LLMs) in retrieving contextual information from large text documents. Our benchmark, Bug In The Code Stack (BICS), is designed to assess the ability of LLMs to identify simple syntax bugs within large source code. Our findings reveal three key insights: (1) code-based environments pose significantly more challenge compared to text-based environments for retrieval tasks, (2) there is a substantial performance disparity among different models, and (3) there is a notable correlation between longer context lengths and performance degradation.
arXiv Detail & Related papers (2024-06-21T17:37:10Z)
Not All Layers of LLMs Are Necessary During Inference [68.88671495401483]
We show that for some tasks, Large Language Models can achieve results comparable to the final output at some intermediate layers. We propose a simple yet effective algorithm named AdaInfer to adaptively terminate the inference process for an input instance.
arXiv Detail & Related papers (2024-03-04T16:23:58Z)
Code Prompting Elicits Conditional Reasoning Abilities in Text+Code LLMs [65.2379940117181]
We introduce code prompting, a chain of prompts that transforms a natural language problem into code. We find that code prompting exhibits a high-performance boost for multiple LLMs. Our analysis of GPT 3.5 reveals that the code formatting of the input problem is essential for performance improvement.
arXiv Detail & Related papers (2024-01-18T15:32:24Z)
Which Syntactic Capabilities Are Statistically Learned by Masked Language Models for Code? [51.29970742152668]
We highlight relying on accuracy-based measurements may lead to an overestimation of models' capabilities. To address these issues, we introduce a technique called SyntaxEval in Syntactic Capabilities.
arXiv Detail & Related papers (2024-01-03T02:44:02Z)
ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases. We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets. Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z)
Large Language Model-Aware In-Context Learning for Code Generation [75.68709482932903]
Large language models (LLMs) have shown impressive in-context learning (ICL) ability in code generation. We propose a novel learning-based selection approach named LAIL (LLM-Aware In-context Learning) for code generation.
arXiv Detail & Related papers (2023-10-15T06:12:58Z)
Benchmarking Causal Study to Interpret Large Language Models for Source Code [6.301373791541809]
This paper introduces a benchmarking strategy named Galeras comprised of curated testbeds for three SE tasks. We illustrate the insights of our benchmarking strategy by conducting a case study on the performance of ChatGPT under distinct prompt engineering methods.
arXiv Detail & Related papers (2023-08-23T20:32:12Z)
Coarse-Tuning Models of Code with Reinforcement Learning Feedback [0.0]
Large Language Models (LLMs) pre-trained on code have emerged as the dominant approach to program synthesis. We propose RLCF, that further trains a pre-trained LLM via reinforcement learning, using feedback from a grounding function that scores the quality of the code.
arXiv Detail & Related papers (2023-05-25T22:09:08Z)
LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset. Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.