Quantifying Memory Use in Reinforcement Learning with Temporal Range
- URL: http://arxiv.org/abs/2512.06204v1
- Date: Fri, 05 Dec 2025 22:58:09 GMT
- Title: Quantifying Memory Use in Reinforcement Learning with Temporal Range
- Authors: Rodney Lafuente-Mercado, Daniela Rus, T. Konstantin Rusch,
- Abstract summary: Temporal Range is a model-agnostic metric that treats first-order sensitivities of multiple vector outputs across a temporal window to the input sequence as a temporal influence profile.<n>We also report Temporal Range for a compact Long Expressive Memory (LEM) policy trained on the task, using it as a proxy readout of task-level memory.
- Score: 51.98491034847041
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: How much does a trained RL policy actually use its past observations? We propose \emph{Temporal Range}, a model-agnostic metric that treats first-order sensitivities of multiple vector outputs across a temporal window to the input sequence as a temporal influence profile and summarizes it by the magnitude-weighted average lag. Temporal Range is computed via reverse-mode automatic differentiation from the Jacobian blocks $\partial y_s/\partial x_t\in\mathbb{R}^{c\times d}$ averaged over final timesteps $s\in\{t+1,\dots,T\}$ and is well-characterized in the linear setting by a small set of natural axioms. Across diagnostic and control tasks (POPGym; flicker/occlusion; Copy-$k$) and architectures (MLPs, RNNs, SSMs), Temporal Range (i) remains small in fully observed control, (ii) scales with the task's ground-truth lag in Copy-$k$, and (iii) aligns with the minimum history window required for near-optimal return as confirmed by window ablations. We also report Temporal Range for a compact Long Expressive Memory (LEM) policy trained on the task, using it as a proxy readout of task-level memory. Our axiomatic treatment draws on recent work on range measures, specialized here to temporal lag and extended to vector-valued outputs in the RL setting. Temporal Range thus offers a practical per-sequence readout of memory dependence for comparing agents and environments and for selecting the shortest sufficient context.
Related papers
- Minimax Optimal Strategy for Delayed Observations in Online Reinforcement Learning [8.140056861479176]
We study reinforcement learning with delayed state observation, where the agent observes the current state after some random number of time steps.<n>We propose an algorithm that combines the augmentation method and the upper confidence bound approach.
arXiv Detail & Related papers (2026-03-03T19:52:24Z) - Kad: A Framework for Proxy-based Test-time Alignment with Knapsack Approximation Deferral [6.949966663998242]
Large language models (LLM) still require further alignment to adhere to downstream task requirements and stylistic preferences.<n>As LLMs continue to scale in terms of size, the computational cost of alignment procedures increase prohibitively.<n>We propose a novel approach to circumvent these costs via proxy-based test-time alignment.
arXiv Detail & Related papers (2025-10-30T21:38:45Z) - LLM Serving Optimization with Variable Prefill and Decode Lengths [6.937936394246354]
We study the problem of serving LLM (Large Language Model) requests where each request has heterogeneous prefill and decode lengths.<n>We show that this problem is NP-hard due to the interplay of placement constraints, precedence relationships, and linearly increasing memory usage.<n>We propose a novel algorithm based on a new selection metric that efficiently forms batches over time.
arXiv Detail & Related papers (2025-08-08T08:54:21Z) - LaMPE: Length-aware Multi-grained Positional Encoding for Adaptive Long-context Scaling Without Training [45.74983991122073]
Large language models (LLMs) experience significant performance degradation when the input exceeds the pretraining context window.<n>Recent studies mitigate this problem by remapping OOD positions into the in-distribution range with fixed mapping strategies.<n>We propose Length-aware Multi-grained Positional Scaling (LaMPE), a training-free method that fully utilizes the model's effective context window.
arXiv Detail & Related papers (2025-08-04T11:22:13Z) - FFT-based Dynamic Subspace Selection for Low-Rank Adaptive Optimization of Large Language Models [49.397861654088636]
We propose a two-step procedure to approximate SVD/QR-based gradient projections into lower-dimensional spaces.<n>We show that our strategy achieves faster runtime and reduced memory usage by up to $25%$ across different model sizes.
arXiv Detail & Related papers (2025-05-23T14:37:00Z) - Truncating Trajectories in Monte Carlo Policy Evaluation: an Adaptive Approach [51.76826149868971]
Policy evaluation via Monte Carlo simulation is at the core of many MC Reinforcement Learning (RL) algorithms.
We propose as a quality index a surrogate of the mean squared error of a return estimator that uses trajectories of different lengths.
We present an adaptive algorithm called Robust and Iterative Data collection strategy Optimization (RIDO)
arXiv Detail & Related papers (2024-10-17T11:47:56Z) - DASA: Delay-Adaptive Multi-Agent Stochastic Approximation [64.32538247395627]
We consider a setting in which $N$ agents aim to speedup a common Approximation problem by acting in parallel and communicating with a central server.
To mitigate the effect of delays and stragglers, we propose textttDASA, a Delay-Adaptive algorithm for multi-agent Approximation.
arXiv Detail & Related papers (2024-03-25T22:49:56Z) - HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM
Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator.
We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z) - Provably Efficient CVaR RL in Low-rank MDPs [58.58570425202862]
We study risk-sensitive Reinforcement Learning (RL)
We propose a novel Upper Confidence Bound (UCB) bonus-driven algorithm to balance interplay between exploration, exploitation, and representation learning in CVaR RL.
We prove that our algorithm achieves a sample complexity of $epsilon$-optimal CVaR, where $H$ is the length of each episode, $A$ is the capacity of action space, and $d$ is the dimension of representations.
arXiv Detail & Related papers (2023-11-20T17:44:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.