Related papers: Anchor-MoE: A Mean-Anchored Mixture of Experts For Probabilistic Regression

Anchor-MoE: A Mean-Anchored Mixture of Experts For Probabilistic Regression

URL: http://arxiv.org/abs/2508.16802v1
Date: Fri, 22 Aug 2025 21:12:41 GMT
Title: Anchor-MoE: A Mean-Anchored Mixture of Experts For Probabilistic Regression
Authors: Baozhuo Su, Zhengxian Qu,
Abstract summary: We present an Anchored Mixture of Experts (Anchor-MoE) model that handles both probabilistic and point regression.<n>We show that Anchor-MoE attains the minimax-optimal $L2$ risk rate.<n>It consistently matches or surpasses the strong NG baseline in RMSE and NLL.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Regression under uncertainty is fundamental across science and engineering. We present an Anchored Mixture of Experts (Anchor-MoE), a model that handles both probabilistic and point regression. For simplicity, we use a tuned gradient-boosting model to furnish the anchor mean; however, any off-the-shelf point regressor can serve as the anchor. The anchor prediction is projected into a latent space, where a learnable metric-window kernel scores locality and a soft router dispatches each sample to a small set of mixture-density-network experts; the experts produce a heteroscedastic correction and predictive variance. We train by minimizing negative log-likelihood, and on a disjoint calibration split fit a post-hoc linear map on predicted means to improve point accuracy. On the theory side, assuming a H\"older smooth regression function of order~$\alpha$ and fixed Lipschitz partition-of-unity weights with bounded overlap, we show that Anchor-MoE attains the minimax-optimal $L^2$ risk rate $O\!\big(N^{-2\alpha/(2\alpha+d)}\big)$. In addition, the CRPS test generalization gap scales as $\widetilde{O}\!\Big(\sqrt{(\log(Mh)+P+K)/N}\Big)$; it is logarithmic in $Mh$ and scales as the square root in $P$ and $K$. Under bounded-overlap routing, $K$ can be replaced by $k$, and any dependence on a latent dimension is absorbed into $P$. Under uniformly bounded means and variances, an analogous $\widetilde{O}\!\big(\sqrt{(\log(Mh)+P+K)/N}\big)$ scaling holds for the test NLL up to constants. Empirically, across standard UCI regressions, Anchor-MoE consistently matches or surpasses the strong NGBoost baseline in RMSE and NLL; on several datasets it achieves new state-of-the-art probabilistic regression results on our benchmark suite. Code is available at https://github.com/BaozhuoSU/Probabilistic_Regression.

Related papers

Optimal Unconstrained Self-Distillation in Ridge Regression: Strict Improvements, Precise Asymptotics, and One-Shot Tuning [61.07540493350384]
Self-distillation (SD) is the process of retraining a student on a mixture of ground-truth and the teacher's own predictions.<n>We show that for any prediction risk, the optimally mixed student improves upon the ridge teacher for every regularization level.<n>We propose a consistent one-shot tuning method to estimate $star$ without grid search, sample splitting, or refitting.
arXiv Detail & Related papers (2026-02-19T17:21:15Z)
Stabilizing Fixed-Point Iteration for Markov Chain Poisson Equations [49.702772230127465]
We study finite-state Markov chains with $n$ states and transition matrix $P$.<n>We show that all non-decaying modes are captured by a real peripheral invariant subspace $mathcalK(P)$, and that the induced operator on the quotient space $mathbbRn/mathcalK(P) is strictly contractive, yielding a unique quotient solution.
arXiv Detail & Related papers (2026-01-31T02:57:01Z)
Closed-form $\ell_r$ norm scaling with data for overparameterized linear regression and diagonal linear networks under $\ell_p$ bias [0.0]
We give a unified, high-probability characterization for the scaling of the family of parameter norms.<n>We then study linear networks trained by descent.
arXiv Detail & Related papers (2025-09-25T13:59:22Z)
Data subsampling for Poisson regression with pth-root-link [53.63838219437508]
We develop and analyze data subsampling techniques for Poisson regression. In particular, we consider the Poisson generalized linear model with ID- and square root-link functions.
arXiv Detail & Related papers (2024-10-30T10:09:05Z)
Stable Minima Cannot Overfit in Univariate ReLU Networks: Generalization by Large Step Sizes [29.466981306355066]
We show that gradient descent with a fixed learning rate $eta$ can only find local minima that represent smooth functions. We also prove a nearly-optimal MSE bound of $widetildeO(n-4/5)$ within the strict interior of the support of the $n$ data points.
arXiv Detail & Related papers (2024-06-10T22:57:27Z)
Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit Feedback and Unknown Transition [71.33787410075577]
We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses. We propose a new algorithm that attains an $widetildeO(dsqrtHS3K + sqrtHSAK)$ regret with high probability.
arXiv Detail & Related papers (2024-03-07T15:03:50Z)
Robust Nonparametric Regression under Poisoning Attack [13.470899588917716]
An adversarial attacker can modify the values of up to $q$ samples from a training dataset of size $N$. Our initial solution is an M-estimator based on Huber loss minimization. The final estimate is nearly minimax optimal for arbitrary $q$, up to a $ln N$ factor.
arXiv Detail & Related papers (2023-05-26T09:33:17Z)
$p$-Generalized Probit Regression and Scalable Maximum Likelihood Estimation via Sketching and Coresets [74.37849422071206]
We study the $p$-generalized probit regression model, which is a generalized linear model for binary responses. We show how the maximum likelihood estimator for $p$-generalized probit regression can be approximated efficiently up to a factor of $(1+varepsilon)$ on large data.
arXiv Detail & Related papers (2022-03-25T10:54:41Z)
Consistent Estimation for PCA and Sparse Regression with Oblivious Outliers [13.244654316770815]
We develop machinery to design efficiently computable and consistent estimators. For sparse regression, we achieve consistency for optimal sample size $ngsim (klog d)/alpha2$. In the context of PCA, we attain optimal error guarantees under broad spikiness assumptions on the parameter matrix.
arXiv Detail & Related papers (2021-11-04T15:59:44Z)
Online nonparametric regression with Sobolev kernels [99.12817345416846]
We derive the regret upper bounds on the classes of Sobolev spaces $W_pbeta(mathcalX)$, $pgeq 2, beta>fracdp$. The upper bounds are supported by the minimax regret analysis, which reveals that in the cases $beta> fracd2$ or $p=infty$ these rates are (essentially) optimal.
arXiv Detail & Related papers (2021-02-06T15:05:14Z)
Variance-Aware Confidence Set: Variance-Dependent Bound for Linear Bandits and Horizon-Free Bound for Linear Mixture MDP [76.94328400919836]
We show how to construct variance-aware confidence sets for linear bandits and linear mixture Decision Process (MDP) For linear bandits, we obtain an $widetildeO(mathrmpoly(d)sqrt1 + sum_i=1Ksigma_i2) regret bound, where $d is the feature dimension. For linear mixture MDP, we obtain an $widetildeO(mathrmpoly(d)sqrtK)$ regret bound, where
arXiv Detail & Related papers (2021-01-29T18:57:52Z)
Outlier-robust sparse/low-rank least-squares regression and robust matrix completion [1.0878040851637998]
We study high-dimensional least-squares regression within a subgaussian statistical learning framework with heterogeneous noise. We also present a novel theory of trace-regression with matrix decomposition based on a new application of the product process.
arXiv Detail & Related papers (2020-12-12T07:42:47Z)
Optimal Robust Linear Regression in Nearly Linear Time [97.11565882347772]
We study the problem of high-dimensional robust linear regression where a learner is given access to $n$ samples from the generative model $Y = langle X,w* rangle + epsilon$ We propose estimators for this problem under two settings: (i) $X$ is L4-L2 hypercontractive, $mathbbE [XXtop]$ has bounded condition number and $epsilon$ has bounded variance and (ii) $X$ is sub-Gaussian with identity second moment and $epsilon$ is
arXiv Detail & Related papers (2020-07-16T06:44:44Z)
Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction [63.41789556777387]
Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP) We show that the number of samples needed to yield an entrywise $varepsilon$-accurate estimate of the Q-function is at most on the order of $frac1mu_min (1-gamma)5varepsilon2+ fract_mixmu_min (1-gamma)$ up to some logarithmic factor.
arXiv Detail & Related papers (2020-06-04T17:51:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.