Anchor-MoE: A Mean-Anchored Mixture of Experts For Probabilistic Regression
- URL: http://arxiv.org/abs/2508.16802v1
- Date: Fri, 22 Aug 2025 21:12:41 GMT
- Title: Anchor-MoE: A Mean-Anchored Mixture of Experts For Probabilistic Regression
- Authors: Baozhuo Su, Zhengxian Qu,
- Abstract summary: We present an Anchored Mixture of Experts (Anchor-MoE) model that handles both probabilistic and point regression.<n>We show that Anchor-MoE attains the minimax-optimal $L2$ risk rate.<n>It consistently matches or surpasses the strong NG baseline in RMSE and NLL.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Regression under uncertainty is fundamental across science and engineering. We present an Anchored Mixture of Experts (Anchor-MoE), a model that handles both probabilistic and point regression. For simplicity, we use a tuned gradient-boosting model to furnish the anchor mean; however, any off-the-shelf point regressor can serve as the anchor. The anchor prediction is projected into a latent space, where a learnable metric-window kernel scores locality and a soft router dispatches each sample to a small set of mixture-density-network experts; the experts produce a heteroscedastic correction and predictive variance. We train by minimizing negative log-likelihood, and on a disjoint calibration split fit a post-hoc linear map on predicted means to improve point accuracy. On the theory side, assuming a H\"older smooth regression function of order~$\alpha$ and fixed Lipschitz partition-of-unity weights with bounded overlap, we show that Anchor-MoE attains the minimax-optimal $L^2$ risk rate $O\!\big(N^{-2\alpha/(2\alpha+d)}\big)$. In addition, the CRPS test generalization gap scales as $\widetilde{O}\!\Big(\sqrt{(\log(Mh)+P+K)/N}\Big)$; it is logarithmic in $Mh$ and scales as the square root in $P$ and $K$. Under bounded-overlap routing, $K$ can be replaced by $k$, and any dependence on a latent dimension is absorbed into $P$. Under uniformly bounded means and variances, an analogous $\widetilde{O}\!\big(\sqrt{(\log(Mh)+P+K)/N}\big)$ scaling holds for the test NLL up to constants. Empirically, across standard UCI regressions, Anchor-MoE consistently matches or surpasses the strong NGBoost baseline in RMSE and NLL; on several datasets it achieves new state-of-the-art probabilistic regression results on our benchmark suite. Code is available at https://github.com/BaozhuoSU/Probabilistic_Regression.
Related papers
- Optimal Unconstrained Self-Distillation in Ridge Regression: Strict Improvements, Precise Asymptotics, and One-Shot Tuning [61.07540493350384]
Self-distillation (SD) is the process of retraining a student on a mixture of ground-truth and the teacher's own predictions.<n>We show that for any prediction risk, the optimally mixed student improves upon the ridge teacher for every regularization level.<n>We propose a consistent one-shot tuning method to estimate $star$ without grid search, sample splitting, or refitting.
arXiv Detail & Related papers (2026-02-19T17:21:15Z) - Stabilizing Fixed-Point Iteration for Markov Chain Poisson Equations [49.702772230127465]
We study finite-state Markov chains with $n$ states and transition matrix $P$.<n>We show that all non-decaying modes are captured by a real peripheral invariant subspace $mathcalK(P)$, and that the induced operator on the quotient space $mathbbRn/mathcalK(P) is strictly contractive, yielding a unique quotient solution.
arXiv Detail & Related papers (2026-01-31T02:57:01Z) - Closed-form $\ell_r$ norm scaling with data for overparameterized linear regression and diagonal linear networks under $\ell_p$ bias [0.0]
We give a unified, high-probability characterization for the scaling of the family of parameter norms.<n>We then study linear networks trained by descent.
arXiv Detail & Related papers (2025-09-25T13:59:22Z) - Data subsampling for Poisson regression with pth-root-link [53.63838219437508]
We develop and analyze data subsampling techniques for Poisson regression.
In particular, we consider the Poisson generalized linear model with ID- and square root-link functions.
arXiv Detail & Related papers (2024-10-30T10:09:05Z) - Stable Minima Cannot Overfit in Univariate ReLU Networks: Generalization by Large Step Sizes [29.466981306355066]
We show that gradient descent with a fixed learning rate $eta$ can only find local minima that represent smooth functions.
We also prove a nearly-optimal MSE bound of $widetildeO(n-4/5)$ within the strict interior of the support of the $n$ data points.
arXiv Detail & Related papers (2024-06-10T22:57:27Z) - Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit
Feedback and Unknown Transition [71.33787410075577]
We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses.
We propose a new algorithm that attains an $widetildeO(dsqrtHS3K + sqrtHSAK)$ regret with high probability.
arXiv Detail & Related papers (2024-03-07T15:03:50Z) - Robust Nonparametric Regression under Poisoning Attack [13.470899588917716]
An adversarial attacker can modify the values of up to $q$ samples from a training dataset of size $N$.
Our initial solution is an M-estimator based on Huber loss minimization.
The final estimate is nearly minimax optimal for arbitrary $q$, up to a $ln N$ factor.
arXiv Detail & Related papers (2023-05-26T09:33:17Z) - $p$-Generalized Probit Regression and Scalable Maximum Likelihood
Estimation via Sketching and Coresets [74.37849422071206]
We study the $p$-generalized probit regression model, which is a generalized linear model for binary responses.
We show how the maximum likelihood estimator for $p$-generalized probit regression can be approximated efficiently up to a factor of $(1+varepsilon)$ on large data.
arXiv Detail & Related papers (2022-03-25T10:54:41Z) - Consistent Estimation for PCA and Sparse Regression with Oblivious
Outliers [13.244654316770815]
We develop machinery to design efficiently computable and consistent estimators.
For sparse regression, we achieve consistency for optimal sample size $ngsim (klog d)/alpha2$.
In the context of PCA, we attain optimal error guarantees under broad spikiness assumptions on the parameter matrix.
arXiv Detail & Related papers (2021-11-04T15:59:44Z) - Online nonparametric regression with Sobolev kernels [99.12817345416846]
We derive the regret upper bounds on the classes of Sobolev spaces $W_pbeta(mathcalX)$, $pgeq 2, beta>fracdp$.
The upper bounds are supported by the minimax regret analysis, which reveals that in the cases $beta> fracd2$ or $p=infty$ these rates are (essentially) optimal.
arXiv Detail & Related papers (2021-02-06T15:05:14Z) - Variance-Aware Confidence Set: Variance-Dependent Bound for Linear
Bandits and Horizon-Free Bound for Linear Mixture MDP [76.94328400919836]
We show how to construct variance-aware confidence sets for linear bandits and linear mixture Decision Process (MDP)
For linear bandits, we obtain an $widetildeO(mathrmpoly(d)sqrt1 + sum_i=1Ksigma_i2) regret bound, where $d is the feature dimension.
For linear mixture MDP, we obtain an $widetildeO(mathrmpoly(d)sqrtK)$ regret bound, where
arXiv Detail & Related papers (2021-01-29T18:57:52Z) - Outlier-robust sparse/low-rank least-squares regression and robust
matrix completion [1.0878040851637998]
We study high-dimensional least-squares regression within a subgaussian statistical learning framework with heterogeneous noise.
We also present a novel theory of trace-regression with matrix decomposition based on a new application of the product process.
arXiv Detail & Related papers (2020-12-12T07:42:47Z) - Optimal Robust Linear Regression in Nearly Linear Time [97.11565882347772]
We study the problem of high-dimensional robust linear regression where a learner is given access to $n$ samples from the generative model $Y = langle X,w* rangle + epsilon$
We propose estimators for this problem under two settings: (i) $X$ is L4-L2 hypercontractive, $mathbbE [XXtop]$ has bounded condition number and $epsilon$ has bounded variance and (ii) $X$ is sub-Gaussian with identity second moment and $epsilon$ is
arXiv Detail & Related papers (2020-07-16T06:44:44Z) - Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and
Variance Reduction [63.41789556777387]
Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP)
We show that the number of samples needed to yield an entrywise $varepsilon$-accurate estimate of the Q-function is at most on the order of $frac1mu_min (1-gamma)5varepsilon2+ fract_mixmu_min (1-gamma)$ up to some logarithmic factor.
arXiv Detail & Related papers (2020-06-04T17:51:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.