Related papers: KromHC: Manifold-Constrained Hyper-Connections with Kronecker-Product Residual Matrices

KromHC: Manifold-Constrained Hyper-Connections with Kronecker-Product Residual Matrices

URL: http://arxiv.org/abs/2601.21579v1
Date: Thu, 29 Jan 2026 11:43:05 GMT
Title: KromHC: Manifold-Constrained Hyper-Connections with Kronecker-Product Residual Matrices
Authors: Wuyang Zhou, Yuxuan Gu, Giorgos Iacovides, Danilo Mandic,
Abstract summary: We propose textbfKromHC, which uses the underlineKronecker products of smaller residual matrices to parametrize the residual matrix in underlinemHC.<n>Experiments demonstrate that KromHC matches or even outperforms state-of-the-art mHC variants, while requiring significantly fewer trainable parameters.
Score: 6.968486021891596
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The success of Hyper-Connections (HC) in neural networks (NN) has also highlighted issues related to its training instability and restricted scalability. The Manifold-Constrained Hyper-Connections (mHC) mitigate these challenges by projecting the residual connection space onto a Birkhoff polytope, however, it faces two issues: 1) its iterative Sinkhorn-Knopp (SK) algorithm does not always yield exact doubly stochastic residual matrices; 2) mHC incurs a prohibitive $\mathcal{O}(n^3C)$ parameter complexity with $n$ as the width of the residual stream and $C$ as the feature dimension. The recently proposed mHC-lite reparametrizes the residual matrix via the Birkhoff-von-Neumann theorem to guarantee double stochasticity, but also faces a factorial explosion in its parameter complexity, $\mathcal{O} \left( nC \cdot n! \right)$. To address both challenges, we propose \textbf{KromHC}, which uses the \underline{Kro}necker products of smaller doubly stochastic matrices to parametrize the residual matrix in \underline{mHC}. By enforcing manifold constraints across the factor residual matrices along each mode of the tensorized residual stream, KromHC guarantees exact double stochasticity of the residual matrices while reducing parameter complexity to $\mathcal{O}(n^2C)$. Comprehensive experiments demonstrate that KromHC matches or even outperforms state-of-the-art (SOTA) mHC variants, while requiring significantly fewer trainable parameters. The code is available at \texttt{https://github.com/wz1119/KromHC}.

Related papers

Regularized Online RLHF with Generalized Bilinear Preferences [68.44113000390544]
We consider the problem of contextual online RLHF with general preferences.<n>We adopt the Generalized Bilinear Preference Model to capture preferences via low-rank, skew-symmetric matrices.<n>We prove that the dual gap of the greedy policy is bounded by the square of the estimation error.
arXiv Detail & Related papers (2026-02-26T15:27:53Z)
mHC-lite: You Don't Need 20 Sinkhorn-Knopp Iterations [5.518733929171501]
Unconstrained residual matrices can compromise training stability.<n>DeepSeek's Manifold-Constrained Hyper-Connections (mHC) approximately projects these matrices onto the Birkhoff polytope via iterative Sinkhorn--Knopp (SK) normalization.
arXiv Detail & Related papers (2026-01-09T11:19:14Z)
Near-Optimal Clustering in Mixture of Markov Chains [74.3828414695655]
We study the problem of clustering $T$ trajectories of length $H$, each generated by one of $K$ unknown ergodic Markov chains over a finite state space of size $S$.<n>We derive an instance-dependent, high-probability lower bound on the clustering error rate, governed by the weighted KL divergence between the transition kernels of the chains.<n>We then present a novel two-stage clustering algorithm.
arXiv Detail & Related papers (2025-06-02T05:10:40Z)
Entropy and singular-value moments of products of truncated random unitary matrices [0.0]
Products of truncated unitary matrices can be used to study universal aspects of monitored quantum circuits.<n>In entropy reduction crosses over from a linear to a logarithmic dependence on $tau$ when this parameter crosses unity.<n>Result is an expression for the singular-value moments of the matrix product in terms of the Erlang function from queueing theory.
arXiv Detail & Related papers (2025-01-19T15:46:08Z)
Reducing QUBO Density by Factoring Out Semi-Symmetries [4.581191399651181]
We introduce the concept of textitsemi-symmetries in QUBO matrices.<n>We show that our algorithm reduces the number of couplings and circuit depth by up to $45%.
arXiv Detail & Related papers (2024-12-18T12:05:18Z)
Reducing QAOA Circuit Depth by Factoring out Semi-Symmetries [4.958204128486634]
We show that our modified QUBO matrix $Q_Hamilton$ describes the same energy spectrum as the original $Q$. Our algorithm achieved reductions in the number of couplings by up to $49%$ and in circuit depth by up to $41%$.
arXiv Detail & Related papers (2024-11-13T18:04:01Z)
Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators. Key to our solution is a novel projection technique based on ideas from harmonic analysis. Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z)
Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks [54.177130905659155]
Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks. In this paper, we study a suitable function space for over- parameterized two-layer neural networks with bounded norms.
arXiv Detail & Related papers (2024-04-29T15:04:07Z)
Multi-block-Single-probe Variance Reduced Estimator for Coupled Compositional Optimization [49.58290066287418]
We propose a novel method named Multi-block-probe Variance Reduced (MSVR) to alleviate the complexity of compositional problems. Our results improve upon prior ones in several aspects, including the order of sample complexities and dependence on strongity.
arXiv Detail & Related papers (2022-07-18T12:03:26Z)
Perturbational Complexity by Distribution Mismatch: A Systematic Analysis of Reinforcement Learning in Reproducing Kernel Hilbert Space [0.76146285961466]
We analyze reinforcement learning in a general reproducing kernel Hilbert space (RKHS) We consider a family of Markov decision processes $mathcalM$ of which the reward functions lie in the unit ball of an RKHS. We show that when the reward functions lie in a high dimensional RKHS, even if the transition probability is known and the action space is finite, it is still possible for RL problems to suffer from the curse of dimensionality.
arXiv Detail & Related papers (2021-11-05T12:46:04Z)
Annihilating Entanglement Between Cones [77.34726150561087]
We show that Lorentz cones are the only cones with a symmetric base for which a certain stronger version of the resilience property is satisfied. Our proof exploits the symmetries of the Lorentz cones and applies two constructions resembling protocols for entanglement distillation.
arXiv Detail & Related papers (2021-10-22T15:02:39Z)
Optimal policy evaluation using kernel-based temporal difference methods [78.83926562536791]
We use kernel Hilbert spaces for estimating the value function of an infinite-horizon discounted Markov reward process. We derive a non-asymptotic upper bound on the error with explicit dependence on the eigenvalues of the associated kernel operator. We prove minimax lower bounds over sub-classes of MRPs.
arXiv Detail & Related papers (2021-09-24T14:48:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.