Related papers: mHC-lite: You Don't Need 20 Sinkhorn-Knopp Iterations

mHC-lite: You Don't Need 20 Sinkhorn-Knopp Iterations

URL: http://arxiv.org/abs/2601.05732v1
Date: Fri, 09 Jan 2026 11:19:14 GMT
Title: mHC-lite: You Don't Need 20 Sinkhorn-Knopp Iterations
Authors: Yongyi Yang, Jianyang Gao,
Abstract summary: Unconstrained residual matrices can compromise training stability.<n>DeepSeek's Manifold-Constrained Hyper-Connections (mHC) approximately projects these matrices onto the Birkhoff polytope via iterative Sinkhorn--Knopp (SK) normalization.
Score: 5.518733929171501
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Hyper-Connections (HC) generalizes residual connections by introducing dynamic residual matrices that mix information across multiple residual streams, accelerating convergence in deep neural networks. However, unconstrained residual matrices can compromise training stability. To address this, DeepSeek's Manifold-Constrained Hyper-Connections (mHC) approximately projects these matrices onto the Birkhoff polytope via iterative Sinkhorn--Knopp (SK) normalization. We identify two limitations of this approach: (i) finite SK iterations do not guarantee exact doubly stochasticity, leaving an approximation gap that can accumulate through network depth and undermine stability; (ii) efficient SK implementation requires highly specialized CUDA kernels, raising engineering barriers and reducing portability. Motivated by the Birkhoff--von Neumann theorem, we propose mHC-lite, a simple reparameterization that explicitly constructs doubly stochastic matrices as convex combinations of permutation matrices. This approach guarantees exact doubly stochasticity by construction and can be implemented using only native matrix operations. Extensive experiments demonstrate that mHC-lite matches or exceeds mHC in performance while achieving higher training throughput with a naive implementation and eliminating the residual instabilities observed in both HC and mHC. The code is publicly available at https://github.com/FFTYYY/mhc-lite.

Related papers

JPmHC Dynamical Isometry via Orthogonal Hyper-Connections [2.4311915994390403]
JPmHC is a framework that replaces identity skips with a trainable linear mixer acting on n parallel streams.<n>It prevents gradient pathologies and enhances stability.<n>It achieves faster convergence, higher accuracy, and lower computational cost compared to bistochastic baselines.
arXiv Detail & Related papers (2026-02-20T16:06:01Z)
KromHC: Manifold-Constrained Hyper-Connections with Kronecker-Product Residual Matrices [6.968486021891596]
We propose textbfKromHC, which uses the underlineKronecker products of smaller residual matrices to parametrize the residual matrix in underlinemHC.<n>Experiments demonstrate that KromHC matches or even outperforms state-of-the-art mHC variants, while requiring significantly fewer trainable parameters.
arXiv Detail & Related papers (2026-01-29T11:43:05Z)
Concatenated Matrix SVD: Compression Bounds, Incremental Approximation, and Error-Constrained Clustering [0.0]
We propose three clustering algorithms that merge matrices only when their predicted joint SVD compression error remains below a user-specified threshold.<n>The algorithms span a trade-off between speed, provable accuracy, and scalability, enabling compression-aware clustering with explicit error control.
arXiv Detail & Related papers (2026-01-12T18:15:53Z)
Quantum Simulation of Non-unitary Dynamics via Contour-based Matrix Decomposition [6.538464633253838]
We introduce contour-based matrix decomposition (CBMD), a framework for scalable simulation of non-unitary dynamics.<n>CBMD generalizes Cauchy's residue theorem to matrix-valued functions and directly decomposes a non-Hermitian function into a linear combination of Hermitian ones.
arXiv Detail & Related papers (2025-11-13T12:52:52Z)
Graph-based Clustering Revisited: A Relaxation of Kernel $k$-Means Perspective [73.18641268511318]
We propose a graph-based clustering algorithm that only relaxes the orthonormal constraint to derive clustering results.<n>To ensure a doubly constraint into a gradient, we transform the non-negative constraint into a class probability parameter.
arXiv Detail & Related papers (2025-09-23T09:14:39Z)
MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation [74.34220141721231]
We present MPQ-DMv2, an improved textbfMixed textbfPrecision textbfQuantization framework for extremely low-bit textbfDiffusion textbfModels.
arXiv Detail & Related papers (2025-07-06T08:16:50Z)
BOLT: Block-Orthonormal Lanczos for Trace estimation of matrix functions [2.4578723416255754]
In many large-scale applications, the matrices involved are too large to store or access in full, making a single mat-vec product infeasible.<n>We introduce Subblock SLQ, a variant of BOLT that operates only on small principal submatrices.<n>We provide theoretical guarantees and demonstrate strong empirical performance across a range of high-dimensional settings.
arXiv Detail & Related papers (2025-05-18T08:04:05Z)
Randomized semi-quantum matrix processing [0.0]
We present a hybrid quantum-classical framework for simulating generic matrix functions. The method is based on randomization over the Chebyshev approximation of the target function. We prove advantages on average depths, including quadratic speed-ups on costly parameters.
arXiv Detail & Related papers (2023-07-21T18:00:28Z)
Reconstructing Kernel-based Machine Learning Force Fields with Super-linear Convergence [0.18416014644193063]
We consider the broad class of Nystr"om-type methods to construct preconditioners. All considered methods aim to identify a representative subset of inducing ( Kernel) columns to approximate the dominant kernel spectrum.
arXiv Detail & Related papers (2022-12-24T13:45:50Z)
Semi-Supervised Subspace Clustering via Tensor Low-Rank Representation [64.49871502193477]
We propose a novel semi-supervised subspace clustering method, which is able to simultaneously augment the initial supervisory information and construct a discriminative affinity matrix. Comprehensive experimental results on six commonly-used benchmark datasets demonstrate the superiority of our method over state-of-the-art methods.
arXiv Detail & Related papers (2022-05-21T01:47:17Z)
Optimal policy evaluation using kernel-based temporal difference methods [78.83926562536791]
We use kernel Hilbert spaces for estimating the value function of an infinite-horizon discounted Markov reward process. We derive a non-asymptotic upper bound on the error with explicit dependence on the eigenvalues of the associated kernel operator. We prove minimax lower bounds over sub-classes of MRPs.
arXiv Detail & Related papers (2021-09-24T14:48:20Z)
Self-supervised Symmetric Nonnegative Matrix Factorization [82.59905231819685]
Symmetric nonnegative factor matrix (SNMF) has demonstrated to be a powerful method for data clustering. Inspired by ensemble clustering that aims to seek better clustering results, we propose self-supervised SNMF (S$3$NMF) We take advantage of the sensitivity to code characteristic of SNMF, without relying on any additional information.
arXiv Detail & Related papers (2021-03-02T12:47:40Z)
Multi-Objective Matrix Normalization for Fine-grained Visual Recognition [153.49014114484424]
Bilinear pooling achieves great success in fine-grained visual recognition (FGVC) Recent methods have shown that the matrix power normalization can stabilize the second-order information in bilinear features. We propose an efficient Multi-Objective Matrix Normalization (MOMN) method that can simultaneously normalize a bilinear representation.
arXiv Detail & Related papers (2020-03-30T08:40:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.