Related papers: Sense and Sensitivity: Examining the Influence of Semantic Recall on Long Context Code Reasoning

Sense and Sensitivity: Examining the Influence of Semantic Recall on Long Context Code Reasoning

URL: http://arxiv.org/abs/2505.13353v2
Date: Tue, 20 May 2025 05:45:55 GMT
Title: Sense and Sensitivity: Examining the Influence of Semantic Recall on Long Context Code Reasoning
Authors: Adam Štorek, Mukur Gupta, Samira Hajizadeh, Prashast Srivastava, Suman Jana,
Abstract summary: This paper investigates Large Language Models (LLMs) reasoning ability over code snippets within large repositories.<n>We differentiate between lexical code recall (verbatim retrieval) and semantic code recall (remembering what the code does)<n>Our evaluation of state-of-the-art LLMs reveals a significant drop in code reasoning accuracy as a code snippet approaches the middle of the input context.
Score: 9.719614935865906
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Although modern Large Language Models (LLMs) support extremely large contexts, their effectiveness in utilizing long context for code reasoning remains unclear. This paper investigates LLM reasoning ability over code snippets within large repositories and how it relates to their recall ability. Specifically, we differentiate between lexical code recall (verbatim retrieval) and semantic code recall (remembering what the code does). To measure semantic recall, we propose SemTrace, a code reasoning technique where the impact of specific statements on output is attributable and unpredictable. We also present a method to quantify semantic recall sensitivity in existing benchmarks. Our evaluation of state-of-the-art LLMs reveals a significant drop in code reasoning accuracy as a code snippet approaches the middle of the input context, particularly with techniques requiring high semantic recall like SemTrace. Moreover, we find that lexical recall varies by granularity, with models excelling at function retrieval but struggling with line-by-line recall. Notably, a disconnect exists between lexical and semantic recall, suggesting different underlying mechanisms. Finally, our findings indicate that current code reasoning benchmarks may exhibit low semantic recall sensitivity, potentially underestimating LLM challenges in leveraging in-context information.

Related papers

Locomo-Plus: Beyond-Factual Cognitive Memory Evaluation Framework for LLM Agents [19.76627324918285]
We introduce textbfLoCoMo-Plus, a benchmark for assessing cognitive memory under cue--trigger semantic disconnect.<n>We show that conventional string-matching metrics and explicit task-type prompting are misaligned with such scenarios.<n> Experiments across diverse backbone models, retrieval-based methods, and memory systems demonstrate that cognitive memory remains challenging.
arXiv Detail & Related papers (2026-02-11T10:22:35Z)
EvolMem: A Cognitive-Driven Benchmark for Multi-Session Dialogue Memory [63.84216832544323]
EvolMem is a new benchmark for assessing multi-session memory capabilities of large language models (LLMs) and agent systems.<n>To construct the benchmark, we introduce a hybrid data synthesis framework that consists of topic-initiated generation and narrative-inspired transformations.<n>Extensive evaluation reveals that no LLM consistently outperforms others across all memory dimensions.
arXiv Detail & Related papers (2026-01-07T03:14:42Z)
SMILE: A Composite Lexical-Semantic Metric for Question-Answering Evaluation [55.26111461168754]
We introduce SMILE: Semantic Metric Integrating Lexical Exactness, a novel approach that combines sentence-level semantic understanding with keyword-level semantic understanding and easy keyword matching.<n>It is highly correlated with human judgments and computationally lightweight, bridging the gap between lexical and semantic evaluation.
arXiv Detail & Related papers (2025-11-21T17:30:18Z)
Evaluating Long-Term Memory for Long-Context Question Answering [100.1267054069757]
We present a systematic evaluation of memory-augmented methods using LoCoMo, a benchmark of synthetic long-context dialogues annotated for question-answering tasks.<n>Our findings show that memory-augmented approaches reduce token usage by over 90% while maintaining competitive accuracy.
arXiv Detail & Related papers (2025-10-27T18:03:50Z)
When Names Disappear: Revealing What LLMs Actually Understand About Code [7.691597373321699]
Large Language Models (LLMs) achieve strong results on code tasks, but how they derive program meaning remains unclear.<n>We argue that code communicates through two channels: structural semantics, which define formal behavior, and human-interpretable naming, which conveys intent.<n>Removing the naming channel severely degrades intent-level tasks such as summarization, where models regress to line-by-line descriptions.
arXiv Detail & Related papers (2025-10-03T16:53:13Z)
LongRecall: A Structured Approach for Robust Recall Evaluation in Long-Form Text [14.211177885010029]
LongRecall is a three-stage recall evaluation framework.<n>It decomposes answers into self-contained facts, narrows plausible candidate matches through lexical and semantic filtering, and verifies alignment.<n>We evaluate LongRecall on three challenging long-form QA benchmarks using both human annotations and LLM-based judges.
arXiv Detail & Related papers (2025-08-20T21:41:42Z)
Is Compression Really Linear with Code Intelligence? [60.123628177110206]
textitFormat Annealing is a lightweight, transparent training methodology designed to assess the intrinsic capabilities of pre-trained models equitably.<n>Our empirical results reveal a fundamental logarithmic relationship between measured code intelligence and bits-per-character (BPC)<n>Our work provides a more nuanced understanding of compression's role in developing code intelligence and contributes a robust evaluation framework in the code domain.
arXiv Detail & Related papers (2025-05-16T16:59:14Z)
Beyond Memorization: Evaluating the True Type Inference Capabilities of LLMs for Java Code Snippets [3.152174935904172]
Recent studies have leveraged Large Language Models for type inference on code snippets, showing promising results.<n>However, these results are potentially affected by data leakage, as the benchmark suite (StatType-SO) has been public on GitHub since 2017.<n>We conducted a three-pronged evaluation to comprehensively assess LLMs' type inference capabilities on Java code snippets.
arXiv Detail & Related papers (2025-03-06T04:13:40Z)
Memorize or Generalize? Evaluating LLM Code Generation with Code Rewriting [54.48306552577881]
We argue that large language models (LLMs) are mostly doing memorization (i.e., replicating or reusing large parts of their training data) versus generalization.<n>Existing evaluations largely proxy neglecting surface/structural similarity, thereby conflating benign reuse of repeated code with harmful recall and memorization task correctness.<n>We propose Memorization Risk Index (MRI), a normalized score that combines two signals: (i) how similar the model's answer for the rewritten task is to the original ground-truth solution, and (ii) how much performance drops from the original task to its rewritten counterpart.
arXiv Detail & Related papers (2025-03-04T05:39:24Z)
Assessing Episodic Memory in LLMs with Sequence Order Recall Tasks [42.22616978679253]
We introduce Sequence Order Recall Tasks (SORT), which we adapt from tasks used to study episodic memory in cognitive psychology. SORT requires LLMs to recall the correct order of text segments, and provides a general framework that is both easily extendable and does not require any additional annotations. Based on a human experiment with 155 participants, we show that humans can recall sequence order based on long-term memory of a book.
arXiv Detail & Related papers (2024-10-10T17:17:38Z)
What can Large Language Models Capture about Code Functional Equivalence? [24.178831487657945]
We introduce SeqCoBench, a benchmark for assessing how Code-LLMs can capture code functional equivalence.<n>We conduct evaluations on state-of-the-art (Code)-LLMs to see if they can discern semantically equivalent or different pairs of programs in SeqCoBench.
arXiv Detail & Related papers (2024-08-20T11:19:06Z)
Unveiling Factual Recall Behaviors of Large Language Models through Knowledge Neurons [13.266817091775042]
We investigate whether Large Language Models (LLMs) actively recall or retrieve their internal repositories of factual knowledge when faced with reasoning tasks. We reveal that LLMs fail to harness the critical factual associations under certain circumstances. We assess the effect of Chain-of-Thought (CoT) prompting, a powerful technique for addressing complex reasoning tasks.
arXiv Detail & Related papers (2024-08-06T15:07:08Z)
Rethinking LLM Memorization through the Lens of Adversarial Compression [93.13830893086681]
Large language models (LLMs) trained on web-scale datasets raise substantial concerns regarding permissible data usage. One major question is whether these models "memorize" all their training data or they integrate many data sources in some way more akin to how a human would learn and synthesize information. We propose the Adversarial Compression Ratio (ACR) as a metric for assessing memorization in LLMs.
arXiv Detail & Related papers (2024-04-23T15:49:37Z)
Comments as Natural Logic Pivots: Improve Code Generation via Comment Perspective [85.48043537327258]
We propose MANGO (comMents As Natural loGic pivOts), including a comment contrastive training strategy and a corresponding logical comment decoding strategy. Results indicate that MANGO significantly improves the code pass rate based on the strong baselines. The robustness of the logical comment decoding strategy is notably higher than the Chain-of-thoughts prompting.
arXiv Detail & Related papers (2024-04-11T08:30:46Z)
Robust and Scalable Model Editing for Large Language Models [75.95623066605259]
We propose EREN (Edit models by REading Notes) to improve the scalability and robustness of LLM editing. Unlike existing techniques, it can integrate knowledge from multiple edits, and correctly respond to syntactically similar but semantically unrelated inputs.
arXiv Detail & Related papers (2024-03-26T06:57:23Z)
Code Prompting Elicits Conditional Reasoning Abilities in Text+Code LLMs [65.2379940117181]
We introduce code prompting, a chain of prompts that transforms a natural language problem into code. We find that code prompting exhibits a high-performance boost for multiple LLMs. Our analysis of GPT 3.5 reveals that the code formatting of the input problem is essential for performance improvement.
arXiv Detail & Related papers (2024-01-18T15:32:24Z)
WatME: Towards Lossless Watermarking Through Lexical Redundancy [58.61972059246715]
This study assesses the impact of watermarking on different capabilities of large language models (LLMs) from a cognitive science lens. We introduce Watermarking with Mutual Exclusion (WatME) to seamlessly integrate watermarks.
arXiv Detail & Related papers (2023-11-16T11:58:31Z)
Bridging Code Semantic and LLMs: Semantic Chain-of-Thought Prompting for Code Generation [22.219645213202178]
This paper proposes the "Semantic Chain-of-Thought" approach to intruduce semantic information of code, named SeCoT. We show that SeCoT can achieves state-of-the-art performance, greatly improving the potential for large models and code generation.
arXiv Detail & Related papers (2023-10-16T05:09:58Z)
Large Language Models are In-Context Semantic Reasoners rather than Symbolic Reasoners [75.85554779782048]
Large Language Models (LLMs) have excited the natural language and machine learning community over recent years. Despite of numerous successful applications, the underlying mechanism of such in-context capabilities still remains unclear. In this work, we hypothesize that the learned textitsemantics of language tokens do the most heavy lifting during the reasoning process.
arXiv Detail & Related papers (2023-05-24T07:33:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.