Hierarchical LoRA MoE for Efficient CTR Model Scaling
- URL: http://arxiv.org/abs/2510.10432v1
- Date: Sun, 12 Oct 2025 03:54:11 GMT
- Title: Hierarchical LoRA MoE for Efficient CTR Model Scaling
- Authors: Zhichen Zeng, Mengyue Hang, Xiaolong Liu, Xiaoyi Liu, Xiao Lin, Ruizhong Qiu, Tianxin Wei, Zhining Liu, Siyang Yuan, Chaofei Yang, Yiqun Liu, Hang Yin, Jiyan Yang, Hanghang Tong,
- Abstract summary: HiLoMoE is a hierarchical LoRA MoE framework that enables holistic scaling in a parameter-efficient manner.<n>Unlike conventional stacking, HiLoMoE routes based on prior layer scores rather than outputs, allowing all layers to execute in parallel.
- Score: 56.608809143548946
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep models have driven significant advances in click-through rate (CTR) prediction. While vertical scaling via layer stacking improves model expressiveness, the layer-by-layer sequential computation poses challenges to efficient scaling. Conversely, horizontal scaling through Mixture of Experts (MoE) achieves efficient scaling by activating a small subset of experts in parallel, but flat MoE layers may struggle to capture the hierarchical structure inherent in recommendation tasks. To push the Return-On-Investment (ROI) boundary, we explore the complementary strengths of both directions and propose HiLoMoE, a hierarchical LoRA MoE framework that enables holistic scaling in a parameter-efficient manner. Specifically, HiLoMoE employs lightweight rank-1 experts for parameter-efficient horizontal scaling, and stacks multiple MoE layers with hierarchical routing to enable combinatorially diverse expert compositions. Unlike conventional stacking, HiLoMoE routes based on prior layer scores rather than outputs, allowing all layers to execute in parallel. A principled three-stage training framework ensures stable optimization and expert diversity. Experiments on four public datasets show that HiLoMoE achieving better performance-efficiency tradeoff, achieving an average AUC improvement of 0.20\% in AUC and 18.5\% reduction in FLOPs compared to the non-MoE baseline.
Related papers
- Mixture of Universal Experts: Scaling Virtual Width via Depth-Width Transformation [49.44855760291454]
Mixture-of-Experts (MoE) decouples model capacity from per-token computation.<n>MoE generalization introduces a novel scaling dimension: Virtual Width.<n>MoE consistently outperforms matched MoE baselines by up to 1.3% across scaling regimes.
arXiv Detail & Related papers (2026-03-05T09:07:45Z) - DynaMoE: Dynamic Token-Level Expert Activation with Layer-Wise Adaptive Capacity for Mixture-of-Experts Neural Networks [0.0]
Mixture-of-Experts (MoE) architectures have emerged as a powerful paradigm for scaling neural networks while maintaining computational efficiency.<n>This paper introduces DynaMoE, a novel MoE framework that relaxes both constraints through dynamic token-level expert activation and layer-wise adaptive capacity allocation.
arXiv Detail & Related papers (2026-03-02T10:25:56Z) - ReXMoE: Reusing Experts with Minimal Overhead in Mixture-of-Experts [25.46805026086543]
We describe ReXMoE, a novel MoE architecture that improves routing beyond the existing layer-local approaches.<n>ReXMoE decouples expert dimensionality from per-layer budgets, enabling richer expert combinations without sacrificing individual expert capacity.
arXiv Detail & Related papers (2025-10-20T12:27:55Z) - LD-MoLE: Learnable Dynamic Routing for Mixture of LoRA Experts [24.0422448103907]
We propose LD-MoLE, a Learnable Dynamic routing mechanism for Mixture of LoRA Experts.<n>Our design allows the model to adaptively determine the number of experts to activate for each token at different layers.<n>Our method achieves superior performance, but also demonstrates the ability to learn token-dependent and layer-wise expert allocation.
arXiv Detail & Related papers (2025-09-30T02:38:10Z) - Breaking the MoE LLM Trilemma: Dynamic Expert Clustering with Structured Compression [14.086434595924716]
Mixture-of-Experts (MoE) Large Language Models (LLMs) face a trilemma of load imbalance, parameter redundancy, and communication overhead.<n>We introduce a unified framework based on dynamic expert clustering and structured compression to address these issues cohesively.
arXiv Detail & Related papers (2025-09-27T10:45:58Z) - A Sliding Layer Merging Method for Efficient Depth-Wise Pruning in LLMs [13.000188564679998]
This paper reveals the Patch-like'' feature relationship between layers in large language models by analyzing the correlation of the outputs of different layers in the reproducing kernel Hilbert space.<n>We propose a sliding layer merging method that dynamically selects and fuses consecutive layers from top to bottom according to a pre-defined similarity threshold.<n>Our method outperforms existing pruning techniques in both zero-shot inference performance and retraining recovery quality after pruning.
arXiv Detail & Related papers (2025-02-26T14:15:24Z) - LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.<n>Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.<n>We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z) - Searching for Efficient Linear Layers over a Continuous Space of Structured Matrices [88.33936714942996]
We present a unifying framework that enables searching among all linear operators expressible via an Einstein summation.
We show that differences in the compute-optimal scaling laws are mostly governed by a small number of variables.
We find that Mixture-of-Experts (MoE) learns an MoE in every single linear layer of the model, including the projection in the attention blocks.
arXiv Detail & Related papers (2024-10-03T00:44:50Z) - DLO: Dynamic Layer Operation for Efficient Vertical Scaling of LLMs [46.443316184807145]
We introduce Dynamic Layer Operations (DLO), a novel approach for vertically scaling transformer-based Large Language Models (LLMs)
Unlike traditional Mixture-of-Experts (MoE) methods that focus on extending the model width, our approach targets model depth, addressing the redundancy observed across layer representations for various input samples.
Experimental results demonstrate that DLO not only outperforms the original unscaled models but also achieves comparable results to densely expanded models with significantly improved efficiency.
arXiv Detail & Related papers (2024-07-03T18:34:08Z) - Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training [73.90260246781435]
We present Lory, the first approach that scales such architectures to autoregressive language model pre-training.
We show significant performance gains over parameter-matched dense models on both perplexity and a variety of downstream tasks.
Despite segment-level routing, Lory models achieve competitive performance compared to state-of-the-art MoE models with token-level routing.
arXiv Detail & Related papers (2024-05-06T03:06:33Z) - Higher Layers Need More LoRA Experts [23.72297945365351]
We introduce a novel parameter-efficient MoE method, textittextbfMoE-LtextbfoRA with textbfLayer-wise Expert textbfAllocation (MoLA) for Transformer-based models.
Experiments on six well-known NLP and commonsense QA benchmarks demonstrate that MoLA achieves equal or superior performance compared to all baselines.
arXiv Detail & Related papers (2024-02-13T16:04:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.