Related papers: Correctness isnt Efficiency: Runtime Memory Divergence in LLM-Generated Code

Correctness isnt Efficiency: Runtime Memory Divergence in LLM-Generated Code

URL: http://arxiv.org/abs/2601.01215v1
Date: Sat, 03 Jan 2026 15:42:21 GMT
Title: Correctness isnt Efficiency: Runtime Memory Divergence in LLM-Generated Code
Authors: Prateek Rajput, Yewei Song, Abdoul Aziz Bonkoungou, Iyiola E. Olatunji, Abdoul Kader Kabore, Jacques Klein, Tegawendé F. Bissyandé,
Abstract summary: Large language models (LLMs) can generate programs that pass unit tests, but passing tests does not guarantee reliable runtime behavior.<n>We find that different correct solutions to the same task can show very different memory and performance patterns, which can lead to hidden operational risks.<n>We present a framework to measure execution-time memory stability across multiple correct generations.
Score: 10.464512010462789
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large language models (LLMs) can generate programs that pass unit tests, but passing tests does not guarantee reliable runtime behavior. We find that different correct solutions to the same task can show very different memory and performance patterns, which can lead to hidden operational risks. We present a framework to measure execution-time memory stability across multiple correct generations. At the solution level, we introduce Dynamic Mean Pairwise Distance (DMPD), which uses Dynamic Time Warping to compare the shapes of memory-usage traces after converting them into Monotonic Peak Profiles (MPPs) to reduce transient noise. Aggregating DMPD across tasks yields a model-level Model Instability Score (MIS). Experiments on BigOBench and CodeContests show substantial runtime divergence among correct solutions. Instability often increases with higher sampling temperature even when pass@1 improves. We also observe correlations between our stability measures and software engineering indicators such as cognitive and cyclomatic complexity, suggesting links between operational behavior and maintainability. Our results support stability-aware selection among passing candidates in CI/CD to reduce operational risk without sacrificing correctness. Artifacts are available.

Related papers

ProtoDCS: Towards Robust and Efficient Open-Set Test-Time Adaptation for Vision-Language Models [32.840734752367275]
Prototype-based Double-Check Separation (ProtoDCS) is a robust framework for OSTTA.<n>It separates csID and csOOD samples, enabling safe and efficient adaptation of Vision-Language Models to csID data.<n>ProtoDCS significantly boosts both known-class accuracy and OOD detection metrics.
arXiv Detail & Related papers (2026-02-27T03:39:02Z)
Process-Tensor Tomography of SGD: Measuring Non-Markovian Memory via Back-Flow of Distinguishability [1.078600700827543]
We build a simple model-agnostic witness of training memory based on emphback-flow of distinguishability.<n>We observe consistent positive back-flow with tight bootstrap confidence intervals, amplification under higher momentum, and more micro-steps.<n>We position this as a principled diagnostic and empirical evidence that practical SGD deviates from the Markov idealization.
arXiv Detail & Related papers (2026-01-23T09:03:25Z)
Gated Differentiable Working Memory for Long-Context Language Modeling [80.27483324685434]
We propose Gdwm (Gated Differentiable Working Memory), a framework that introduces a write controller to gate the consolidation process.<n>Experiments on ZeroSCROLLS and LongBench v2 demonstrate that Gdwm achieves comparable or superior performance with 4$times$ fewer gradient steps than uniform baselines.
arXiv Detail & Related papers (2026-01-19T10:00:33Z)
ET-Agent: Incentivizing Effective Tool-Integrated Reasoning Agent via Behavior Calibration [68.89572566071575]
ETAgent is a training framework for calibrating agent's tool-use behavior.<n>It is designed to progressively calibrate erroneous behavioral patterns to optimal behaviors.
arXiv Detail & Related papers (2026-01-11T11:05:26Z)
Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach [78.4812458793128]
We propose textbfTACO, a test-time-scaling framework that applies a lightweight pseudo-count estimator as a high-fidelity verifier of action chunks.<n>Our method resembles the classical anti-exploration principle in offline reinforcement learning (RL), and being gradient-free, it incurs significant computational benefits.
arXiv Detail & Related papers (2025-12-02T14:42:54Z)
Dynamic Stability of LLM-Generated Code [6.120340803716395]
Current evaluations of LLMs for code generation overlook the fact that functionally correct solutions can differ significantly in algorithmic complexity.<n>We introduce a principled framework for evaluating the dynamic stability of generated code.<n>Our findings call for stability-aware objectives in code generation and new benchmarks with test cases for robust, real-world evaluation.
arXiv Detail & Related papers (2025-11-07T09:58:06Z)
ATTS: Asynchronous Test-Time Scaling via Conformal Prediction [112.54016379556073]
Large language models (LLMs) benefit from test-time scaling but are often hampered by high inference latency.<n>We introduce ATTS (Asynchronous Test-Time Scaling), a statistically guaranteed adaptive scaling framework.<n>We show that ATTS delivers up to 56.7x speedup in test-time scaling and a 4.14x throughput improvement.
arXiv Detail & Related papers (2025-09-18T16:55:09Z)
MINGLE: Mixture of Null-Space Gated Low-Rank Experts for Test-Time Continual Model Merging [29.58798660724693]
Continual model merging integrates independently fine-tuned models sequentially without access to the original training data.<n>We propose MINGLE, a novel framework for Test-Time Continual Model Merging.<n> MINGLE achieves robust generalization, significantly reduces forgetting, and consistently surpasses previous state-of-the-art methods by 7-9% on average.
arXiv Detail & Related papers (2025-05-17T07:24:22Z)
CoDynTrust: Robust Asynchronous Collaborative Perception via Dynamic Feature Trust Modulus [9.552300496606644]
Collaborative perception, fusing information from multiple agents, can extend perception range so as to improve performance.<n> temporal asynchrony in real-world environments, caused by communication delays, clock misalignment, or sampling configuration differences, can lead to information mismatches.<n>We propose CoDynTrust, an uncertainty-encoded asynchronous fusion perception framework that is robust to the information mismatches caused by temporal asynchrony.
arXiv Detail & Related papers (2025-02-12T07:23:26Z)
Large Continual Instruction Assistant [59.585544987096974]
Continual Instruction Tuning (CIT) is adopted to instruct Large Models to follow human intent data by data.<n>Existing update gradient would heavily destroy the performance on previous datasets during CIT process.<n>We propose a general continual instruction tuning framework to address the challenge.
arXiv Detail & Related papers (2024-10-08T11:24:59Z)
Bidirectional Decoding: Improving Action Chunking via Guided Test-Time Sampling [51.38330727868982]
We show how action chunking impacts the divergence between a learner and a demonstrator.<n>We propose Bidirectional Decoding (BID), a test-time inference algorithm that bridges action chunking with closed-loop adaptation.<n>Our method boosts the performance of two state-of-the-art generative policies across seven simulation benchmarks and two real-world tasks.
arXiv Detail & Related papers (2024-08-30T15:39:34Z)
AR-TTA: A Simple Method for Real-World Continual Test-Time Adaptation [1.4530711901349282]
We propose to validate test-time adaptation methods using datasets for autonomous driving, namely CLAD-C and SHIFT. We observe that current test-time adaptation methods struggle to effectively handle varying degrees of domain shift. We enhance the well-established self-training framework by incorporating a small memory buffer to increase model stability.
arXiv Detail & Related papers (2023-09-18T19:34:23Z)
Uncertainty Guided Adaptive Warping for Robust and Efficient Stereo Matching [77.133400999703]
Correlation based stereo matching has achieved outstanding performance. Current methods with a fixed model do not work uniformly well across various datasets. This paper proposes a new perspective to dynamically calculate correlation for robust stereo matching.
arXiv Detail & Related papers (2023-07-26T09:47:37Z)
Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models [80.23791222509644]
Inconsistent AI models are considered brittle and untrustworthy by human users. We find that state-of-the-art vision-language models suffer from a surprisingly high degree of inconsistent behavior across tasks. We propose a rank correlation-based auxiliary training objective, computed over large automatically created cross-task contrast sets.
arXiv Detail & Related papers (2023-03-28T16:57:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.