Balancing Computation Load and Representation Expressivity in Parallel Hybrid Neural Networks
- URL: http://arxiv.org/abs/2505.19472v2
- Date: Wed, 28 May 2025 11:06:17 GMT
- Title: Balancing Computation Load and Representation Expressivity in Parallel Hybrid Neural Networks
- Authors: Mohammad Mahdi Moradi, Walid Ahmed, Shuangyue Wen, Sudhir Mudur, Weiwei Zhang, Yang Liu,
- Abstract summary: FlowHN is a novel parallel hybrid network architecture that accommodates various strategies for load balancing.<n>Two innovative differentiating factors in FlowHN include a FLOP aware dynamic token split between the attention and SSM branches.
- Score: 5.877451898618022
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attention and State-Space Models (SSMs) when combined in a hybrid network in sequence or in parallel provide complementary strengths. In a hybrid sequential pipeline they alternate between applying a transformer to the input and then feeding its output into a SSM. This results in idle periods in the individual components increasing end-to-end latency and lowering throughput caps. In the parallel hybrid architecture, the transformer operates independently in parallel with the SSM, and these pairs are cascaded, with output from one pair forming the input to the next. Two issues are (i) creating an expressive knowledge representation with the inherently divergent outputs from these separate branches, and (ii) load balancing the computation between these parallel branches, while maintaining representation fidelity. In this work we present FlowHN, a novel parallel hybrid network architecture that accommodates various strategies for load balancing, achieved through appropriate distribution of input tokens between the two branches. Two innovative differentiating factors in FlowHN include a FLOP aware dynamic token split between the attention and SSM branches yielding efficient balance in compute load, and secondly, a method to fuse the highly divergent outputs from individual branches for enhancing representation expressivity. Together they enable much better token processing speeds, avoid bottlenecks, and at the same time yield significantly improved accuracy as compared to other competing works. We conduct comprehensive experiments on autoregressive language modeling for models with 135M, 350M, and 1B parameters. FlowHN outperforms sequential hybrid models and its parallel counterpart, achieving up to 4* higher Tokens per Second (TPS) and 2* better Model FLOPs Utilization (MFU).
Related papers
- Multiverse: Your Language Models Secretly Decide How to Parallelize and Merge Generation [20.117825519637357]
We introduce Multiverse, a new generative model that enables natively parallel generation.<n>Next, we build a real-world Multiverse reasoning model with co-design curation of data, algorithm, and system.<n>For data creation, we develop Multiverse Curator, an automated LLM-assisted pipeline.<n>We also implement Multiverse Engine to support parallel inference.
arXiv Detail & Related papers (2025-06-11T17:59:23Z) - Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition [95.54406667705999]
Pangu Embedded is an efficient Large Language Model (LLM) reasoner developed on Ascend Neural Processing Units (NPUs)<n>It addresses the significant computational costs and inference latency challenges prevalent in existing reasoning-optimized LLMs.<n>It delivers rapid responses and state-of-the-art reasoning quality within a single, unified model architecture.
arXiv Detail & Related papers (2025-05-28T14:03:02Z) - Systems and Algorithms for Convolutional Multi-Hybrid Language Models at Scale [68.6602625868888]
We introduce convolutional multi-hybrid architectures, with a design grounded on two simple observations.<n>Operators in hybrid models can be tailored to token manipulation tasks such as in-context recall, multi-token recall, and compression.<n>We train end-to-end 1.2 to 2.9 times faster than optimized Transformers, and 1.1 to 1.4 times faster than previous generation hybrids.
arXiv Detail & Related papers (2025-02-25T19:47:20Z) - Recurrent Stochastic Configuration Networks with Incremental Blocks [0.0]
Recurrent configuration networks (RSCNs) have shown promise in modelling nonlinear dynamic systems with order uncertainty.
This paper develops the original RSCNs with block increments, termed block RSCNs (BRSCNs)
BRSCNs can simultaneously add multiple reservoir nodes (subreservoirs) during the construction.
arXiv Detail & Related papers (2024-11-18T05:58:47Z) - EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that surpasses the existing parallelism schemes.<n>Our results demonstrate at most 52.4% improvement in prefill throughput compared to existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z) - Asynchronous Multi-Model Dynamic Federated Learning over Wireless
Networks: Theory, Modeling, and Optimization [20.741776617129208]
Federated learning (FL) has emerged as a key technique for distributed machine learning (ML)
We first formulate rectangular scheduling steps and functions to capture the impact of system parameters on learning performance.
Our analysis sheds light on the joint impact of device training variables and asynchronous scheduling decisions.
arXiv Detail & Related papers (2023-05-22T21:39:38Z) - AMT: All-Pairs Multi-Field Transforms for Efficient Frame Interpolation [80.33846577924363]
We present All-Pairs Multi-Field Transforms (AMT), a new network architecture for video framegithub.
It is based on two essential designs. First, we build bidirectional volumes for all pairs of pixels, and use the predicted bilateral flows to retrieve correlations.
Second, we derive multiple groups of fine-grained flow fields from one pair of updated coarse flows for performing backward warping on the input frames separately.
arXiv Detail & Related papers (2023-04-19T16:18:47Z) - Paraformer: Fast and Accurate Parallel Transformer for
Non-autoregressive End-to-End Speech Recognition [62.83832841523525]
We propose a fast and accurate parallel transformer, termed Paraformer.
It accurately predicts the number of output tokens and extract hidden variables.
It can attain comparable performance to the state-of-the-art AR transformer, with more than 10x speedup.
arXiv Detail & Related papers (2022-06-16T17:24:14Z) - Pathways: Asynchronous Distributed Dataflow for ML [24.940220376358457]
We present the design of a new large scale orchestration layer for accelerators.
Our system, Pathways, is explicitly designed to enable exploration of new systems and ML research ideas.
arXiv Detail & Related papers (2022-03-23T16:50:53Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.