Related papers: Online Learning for Multi-Layer Hierarchical Inference under Partial and Policy-Dependent Feedback

Online Learning for Multi-Layer Hierarchical Inference under Partial and Policy-Dependent Feedback

URL: http://arxiv.org/abs/2603.04247v1
Date: Wed, 04 Mar 2026 16:35:33 GMT
Title: Online Learning for Multi-Layer Hierarchical Inference under Partial and Policy-Dependent Feedback
Authors: Haoran Zhang, Seohyeon Cha, Hasan Burhan Beytur, Kevin S Chan, Gustavo de Veciana, Haris Vikalo,
Abstract summary: We study online routing for hierarchical inference under long-term resource constraints and terminal-only feedback.<n>We develop a variance-optimal EXP4-based algorithm integrated with Lyapunov optimization, yielding unbiased loss estimation and stable learning under sparse and policy-dependent feedback.
Score: 22.44021085629083
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Hierarchical inference systems route tasks across multiple computational layers, where each node may either finalize a prediction locally or offload the task to a node in the next layer for further processing. Learning optimal routing policies in such systems is challenging: inference loss is defined recursively across layers, while feedback on prediction error is revealed only at a terminal oracle layer. This induces a partial, policy-dependent feedback structure in which observability probabilities decay with depth, causing importance-weighted estimators to suffer from amplified variance. We study online routing for multi-layer hierarchical inference under long-term resource constraints and terminal-only feedback. We formalize the recursive loss structure and show that naive importance-weighted contextual bandit methods become unstable as feedback probability decays along the hierarchy. To address this, we develop a variance-reduced EXP4-based algorithm integrated with Lyapunov optimization, yielding unbiased loss estimation and stable learning under sparse and policy-dependent feedback. We provide regret guarantees relative to the best fixed routing policy in hindsight and establish near-optimality under stochastic arrivals and resource constraints. Experiments on large-scale multi-task workloads demonstrate improved stability and performance compared to standard importance-weighted approaches.

Related papers

Expected Return Causes Outcome-Level Mode Collapse in Reinforcement Learning and How to Fix It with Inverse Probability Scaling [2.8825582215429186]
We show that outcome-level mode collapse is a structural consequence of the expected-return itself.<n>We propose a minimal correction: inverse probability scaling, which removes outcome-frequency from the learning signal.
arXiv Detail & Related papers (2026-01-29T13:03:33Z)
Decentralized Learning Strategies for Estimation Error Minimization with Graph Neural Networks [86.99017195607077]
We address real-time sampling and estimation of autoregressive Markovian sources in wireless networks.<n>We propose a graphical reinforcement learning framework for policy optimization.<n>Theoretically, our proposed policies are transferable, allowing a policy trained on one graph to be effectively applied to structurally similar graphs.
arXiv Detail & Related papers (2026-01-19T02:18:45Z)
Latent Chain-of-Thought for Visual Reasoning [53.541579327424046]
Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs)<n>We reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference.<n>We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks.
arXiv Detail & Related papers (2025-10-27T23:10:06Z)
Joint Admission Control and Resource Allocation of Virtual Network Embedding via Hierarchical Deep Reinforcement Learning [69.00997996453842]
We propose a deep Reinforcement Learning approach to learn a joint Admission Control and Resource Allocation policy for virtual network embedding. We show that HRL-ACRA outperforms state-of-the-art baselines in terms of both the acceptance ratio and long-term average revenue.
arXiv Detail & Related papers (2024-06-25T07:42:30Z)
Decentralized Learning Strategies for Estimation Error Minimization with Graph Neural Networks [86.99017195607077]
We address the challenge of sampling and remote estimation for autoregressive Markovian processes in a wireless network with statistically-identical agents.<n>Our goal is to minimize time-average estimation error and/or age of information with decentralized scalable sampling and transmission policies.
arXiv Detail & Related papers (2024-04-04T06:24:11Z)
Robust Stochastically-Descending Unrolled Networks [85.6993263983062]
Deep unrolling is an emerging learning-to-optimize method that unrolls a truncated iterative algorithm in the layers of a trainable neural network.<n>We show that convergence guarantees and generalizability of the unrolled networks are still open theoretical problems.<n>We numerically assess unrolled architectures trained under the proposed constraints in two different applications.
arXiv Detail & Related papers (2023-12-25T18:51:23Z)
Bi-Level Offline Policy Optimization with Limited Exploration [1.8130068086063336]
We study offline reinforcement learning (RL) which seeks to learn a good policy based on a fixed, pre-collected dataset. We propose a bi-level structured policy optimization algorithm that models a hierarchical interaction between the policy (upper-level) and the value function (lower-level) We evaluate our model using a blend of synthetic, benchmark, and real-world datasets for offline RL, showing that it performs competitively with state-of-the-art methods.
arXiv Detail & Related papers (2023-10-10T02:45:50Z)
On Finite-Sample Analysis of Offline Reinforcement Learning with Deep ReLU Networks [46.067702683141356]
We study the statistical theory of offline reinforcement learning with deep ReLU networks. We quantify how the distribution shift of the offline data, the dimension of the input space, and the regularity of the system control the OPE estimation error.
arXiv Detail & Related papers (2021-03-11T14:01:14Z)
Reinforcement Learning for POMDP: Partitioned Rollout and Policy Iteration with Application to Autonomous Sequential Repair Problems [2.6389022766562236]
We consider infinite horizon discounted dynamic programming problems with finite state and control spaces, and partial state observations. We discuss an algorithm that uses multistep lookahead, truncated rollout with a known base policy, and a terminal cost function approximation.
arXiv Detail & Related papers (2020-02-11T02:38:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.