Related papers: Quantifying Memory Use in Reinforcement Learning with Temporal Range

Quantifying Memory Use in Reinforcement Learning with Temporal Range

URL: http://arxiv.org/abs/2512.06204v1
Date: Fri, 05 Dec 2025 22:58:09 GMT
Title: Quantifying Memory Use in Reinforcement Learning with Temporal Range
Authors: Rodney Lafuente-Mercado, Daniela Rus, T. Konstantin Rusch,
Abstract summary: Temporal Range is a model-agnostic metric that treats first-order sensitivities of multiple vector outputs across a temporal window to the input sequence as a temporal influence profile.<n>We also report Temporal Range for a compact Long Expressive Memory (LEM) policy trained on the task, using it as a proxy readout of task-level memory.
Score: 51.98491034847041
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: How much does a trained RL policy actually use its past observations? We propose \emph{Temporal Range}, a model-agnostic metric that treats first-order sensitivities of multiple vector outputs across a temporal window to the input sequence as a temporal influence profile and summarizes it by the magnitude-weighted average lag. Temporal Range is computed via reverse-mode automatic differentiation from the Jacobian blocks $\partial y_s/\partial x_t\in\mathbb{R}^{c\times d}$ averaged over final timesteps $s\in\{t+1,\dots,T\}$ and is well-characterized in the linear setting by a small set of natural axioms. Across diagnostic and control tasks (POPGym; flicker/occlusion; Copy-$k$) and architectures (MLPs, RNNs, SSMs), Temporal Range (i) remains small in fully observed control, (ii) scales with the task's ground-truth lag in Copy-$k$, and (iii) aligns with the minimum history window required for near-optimal return as confirmed by window ablations. We also report Temporal Range for a compact Long Expressive Memory (LEM) policy trained on the task, using it as a proxy readout of task-level memory. Our axiomatic treatment draws on recent work on range measures, specialized here to temporal lag and extended to vector-valued outputs in the RL setting. Temporal Range thus offers a practical per-sequence readout of memory dependence for comparing agents and environments and for selecting the shortest sufficient context.

Related papers

Minimax Optimal Strategy for Delayed Observations in Online Reinforcement Learning [8.140056861479176]
We study reinforcement learning with delayed state observation, where the agent observes the current state after some random number of time steps.<n>We propose an algorithm that combines the augmentation method and the upper confidence bound approach.
arXiv Detail & Related papers (2026-03-03T19:52:24Z)
Kad: A Framework for Proxy-based Test-time Alignment with Knapsack Approximation Deferral [6.949966663998242]
Large language models (LLM) still require further alignment to adhere to downstream task requirements and stylistic preferences.<n>As LLMs continue to scale in terms of size, the computational cost of alignment procedures increase prohibitively.<n>We propose a novel approach to circumvent these costs via proxy-based test-time alignment.
arXiv Detail & Related papers (2025-10-30T21:38:45Z)
LLM Serving Optimization with Variable Prefill and Decode Lengths [6.937936394246354]
We study the problem of serving LLM (Large Language Model) requests where each request has heterogeneous prefill and decode lengths.<n>We show that this problem is NP-hard due to the interplay of placement constraints, precedence relationships, and linearly increasing memory usage.<n>We propose a novel algorithm based on a new selection metric that efficiently forms batches over time.
arXiv Detail & Related papers (2025-08-08T08:54:21Z)
LaMPE: Length-aware Multi-grained Positional Encoding for Adaptive Long-context Scaling Without Training [45.74983991122073]
Large language models (LLMs) experience significant performance degradation when the input exceeds the pretraining context window.<n>Recent studies mitigate this problem by remapping OOD positions into the in-distribution range with fixed mapping strategies.<n>We propose Length-aware Multi-grained Positional Scaling (LaMPE), a training-free method that fully utilizes the model's effective context window.
arXiv Detail & Related papers (2025-08-04T11:22:13Z)
FFT-based Dynamic Subspace Selection for Low-Rank Adaptive Optimization of Large Language Models [49.397861654088636]
We propose a two-step procedure to approximate SVD/QR-based gradient projections into lower-dimensional spaces.<n>We show that our strategy achieves faster runtime and reduced memory usage by up to $25%$ across different model sizes.
arXiv Detail & Related papers (2025-05-23T14:37:00Z)
Truncating Trajectories in Monte Carlo Policy Evaluation: an Adaptive Approach [51.76826149868971]
Policy evaluation via Monte Carlo simulation is at the core of many MC Reinforcement Learning (RL) algorithms. We propose as a quality index a surrogate of the mean squared error of a return estimator that uses trajectories of different lengths. We present an adaptive algorithm called Robust and Iterative Data collection strategy Optimization (RIDO)
arXiv Detail & Related papers (2024-10-17T11:47:56Z)
DASA: Delay-Adaptive Multi-Agent Stochastic Approximation [64.32538247395627]
We consider a setting in which $N$ agents aim to speedup a common Approximation problem by acting in parallel and communicating with a central server. To mitigate the effect of delays and stragglers, we propose textttDASA, a Delay-Adaptive algorithm for multi-agent Approximation.
arXiv Detail & Related papers (2024-03-25T22:49:56Z)
HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator. We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z)
Provably Efficient CVaR RL in Low-rank MDPs [58.58570425202862]
We study risk-sensitive Reinforcement Learning (RL) We propose a novel Upper Confidence Bound (UCB) bonus-driven algorithm to balance interplay between exploration, exploitation, and representation learning in CVaR RL. We prove that our algorithm achieves a sample complexity of $epsilon$-optimal CVaR, where $H$ is the length of each episode, $A$ is the capacity of action space, and $d$ is the dimension of representations.
arXiv Detail & Related papers (2023-11-20T17:44:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.