Related papers: Balancing Computation Load and Representation Expressivity in Parallel Hybrid Neural Networks

Balancing Computation Load and Representation Expressivity in Parallel Hybrid Neural Networks

URL: http://arxiv.org/abs/2505.19472v2
Date: Wed, 28 May 2025 11:06:17 GMT
Title: Balancing Computation Load and Representation Expressivity in Parallel Hybrid Neural Networks
Authors: Mohammad Mahdi Moradi, Walid Ahmed, Shuangyue Wen, Sudhir Mudur, Weiwei Zhang, Yang Liu,
Abstract summary: FlowHN is a novel parallel hybrid network architecture that accommodates various strategies for load balancing.<n>Two innovative differentiating factors in FlowHN include a FLOP aware dynamic token split between the attention and SSM branches.
Score: 5.877451898618022
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Attention and State-Space Models (SSMs) when combined in a hybrid network in sequence or in parallel provide complementary strengths. In a hybrid sequential pipeline they alternate between applying a transformer to the input and then feeding its output into a SSM. This results in idle periods in the individual components increasing end-to-end latency and lowering throughput caps. In the parallel hybrid architecture, the transformer operates independently in parallel with the SSM, and these pairs are cascaded, with output from one pair forming the input to the next. Two issues are (i) creating an expressive knowledge representation with the inherently divergent outputs from these separate branches, and (ii) load balancing the computation between these parallel branches, while maintaining representation fidelity. In this work we present FlowHN, a novel parallel hybrid network architecture that accommodates various strategies for load balancing, achieved through appropriate distribution of input tokens between the two branches. Two innovative differentiating factors in FlowHN include a FLOP aware dynamic token split between the attention and SSM branches yielding efficient balance in compute load, and secondly, a method to fuse the highly divergent outputs from individual branches for enhancing representation expressivity. Together they enable much better token processing speeds, avoid bottlenecks, and at the same time yield significantly improved accuracy as compared to other competing works. We conduct comprehensive experiments on autoregressive language modeling for models with 135M, 350M, and 1B parameters. FlowHN outperforms sequential hybrid models and its parallel counterpart, achieving up to 4* higher Tokens per Second (TPS) and 2* better Model FLOPs Utilization (MFU).

Related papers

Auto-Regressive Masked Diffusion Models [9.239507801466322]
Masked diffusion models (MDMs) have emerged as a promising approach for language modeling.<n>They face a performance gap compared to autoregressive models (ARMs) and require more training iterations.<n>We present the Auto-Regressive Masked Diffusion model, which unifies the training efficiency of autoregressive models with the parallel generation capabilities of diffusion-based models.
arXiv Detail & Related papers (2026-01-23T18:42:30Z)
WAM-Flow: Parallel Coarse-to-Fine Motion Planning via Discrete Flow Matching for Autonomous Driving [9.719456684859606]
We introduce WAM-Flow, a vision--action (VLA) model that casts ego-trajectory planning as discrete flow matching over a structured token space.<n>WAM-Flow performs fully parallel, bidirectional denoising, enabling coarse-to-fine refinement with a tunable compute-accuracy trade-off.<n>These results establish discrete flow matching as a new promising paradigm for end-to-end autonomous driving.
arXiv Detail & Related papers (2025-12-05T19:36:46Z)
DRL: Discriminative Representation Learning with Parallel Adapters for Class Incremental Learning [63.65467569295623]
We propose the Discriminative Representation Learning (DRL) framework to specifically address these challenges.<n>To conduct incremental learning effectively and yet efficiently, the DRL's network is built upon a PTM.<n>Our DRL consistently outperforms other state-of-the-art methods throughout the entire CIL period.
arXiv Detail & Related papers (2025-10-14T03:19:15Z)
CollaPipe: Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks [57.95170323315603]
We introduce CollaPipe, a distributed learning framework that integrates collaborative pipeline parallelism with federated aggregation to support self-evolving networks.<n>In CollaPipe, the encoder part is adaptively partitioned into variable-sized segments and deployed across mobile devices for pipeline-parallel training, while the decoder is deployed on edge servers to handle generative tasks.<n>To enhance training efficiency, we formulate a joint optimization problem that adaptively allocates model segments, micro-batches, bandwidth, and transmission power.
arXiv Detail & Related papers (2025-09-24T07:54:01Z)
ASPD: Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs [34.477777651648914]
Large language models (LLMs) pose significant inference latency challenges due to their autoregressive decoding paradigm.<n>We propose an Adaptive Serial-Parallel Decoding (ASPD) which addresses two core challenges: automated construction of parallelizable data and efficient parallel decoding mechanism.<n>Our framework sets a groundbreaking benchmark for efficient LLM parallel inference, paving the way for its deployment in latency-sensitive applications such as AI-powered customer service bots and answer retrieval engines.
arXiv Detail & Related papers (2025-08-12T12:35:55Z)
Multiverse: Your Language Models Secretly Decide How to Parallelize and Merge Generation [20.117825519637357]
We introduce Multiverse, a new generative model that enables natively parallel generation.<n>Next, we build a real-world Multiverse reasoning model with co-design curation of data, algorithm, and system.<n>For data creation, we develop Multiverse Curator, an automated LLM-assisted pipeline.<n>We also implement Multiverse Engine to support parallel inference.
arXiv Detail & Related papers (2025-06-11T17:59:23Z)
Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition [95.54406667705999]
Pangu Embedded is an efficient Large Language Model (LLM) reasoner developed on Ascend Neural Processing Units (NPUs)<n>It addresses the significant computational costs and inference latency challenges prevalent in existing reasoning-optimized LLMs.<n>It delivers rapid responses and state-of-the-art reasoning quality within a single, unified model architecture.
arXiv Detail & Related papers (2025-05-28T14:03:02Z)
Systems and Algorithms for Convolutional Multi-Hybrid Language Models at Scale [68.6602625868888]
We introduce convolutional multi-hybrid architectures, with a design grounded on two simple observations.<n>Operators in hybrid models can be tailored to token manipulation tasks such as in-context recall, multi-token recall, and compression.<n>We train end-to-end 1.2 to 2.9 times faster than optimized Transformers, and 1.1 to 1.4 times faster than previous generation hybrids.
arXiv Detail & Related papers (2025-02-25T19:47:20Z)
Recurrent Stochastic Configuration Networks with Incremental Blocks [0.0]
Recurrent configuration networks (RSCNs) have shown promise in modelling nonlinear dynamic systems with order uncertainty. This paper develops the original RSCNs with block increments, termed block RSCNs (BRSCNs) BRSCNs can simultaneously add multiple reservoir nodes (subreservoirs) during the construction.
arXiv Detail & Related papers (2024-11-18T05:58:47Z)
EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that surpasses the existing parallelism schemes.<n>Our results demonstrate at most 52.4% improvement in prefill throughput compared to existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z)
Asynchronous Multi-Model Dynamic Federated Learning over Wireless Networks: Theory, Modeling, and Optimization [20.741776617129208]
Federated learning (FL) has emerged as a key technique for distributed machine learning (ML) We first formulate rectangular scheduling steps and functions to capture the impact of system parameters on learning performance. Our analysis sheds light on the joint impact of device training variables and asynchronous scheduling decisions.
arXiv Detail & Related papers (2023-05-22T21:39:38Z)
AMT: All-Pairs Multi-Field Transforms for Efficient Frame Interpolation [80.33846577924363]
We present All-Pairs Multi-Field Transforms (AMT), a new network architecture for video framegithub. It is based on two essential designs. First, we build bidirectional volumes for all pairs of pixels, and use the predicted bilateral flows to retrieve correlations. Second, we derive multiple groups of fine-grained flow fields from one pair of updated coarse flows for performing backward warping on the input frames separately.
arXiv Detail & Related papers (2023-04-19T16:18:47Z)
Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition [62.83832841523525]
We propose a fast and accurate parallel transformer, termed Paraformer. It accurately predicts the number of output tokens and extract hidden variables. It can attain comparable performance to the state-of-the-art AR transformer, with more than 10x speedup.
arXiv Detail & Related papers (2022-06-16T17:24:14Z)
Pathways: Asynchronous Distributed Dataflow for ML [24.940220376358457]
We present the design of a new large scale orchestration layer for accelerators. Our system, Pathways, is explicitly designed to enable exploration of new systems and ML research ideas.
arXiv Detail & Related papers (2022-03-23T16:50:53Z)
CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning. The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery. The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.