Related papers: Parallel Track Transformers: Enabling Fast GPU Inference with Reduced Synchronization

Parallel Track Transformers: Enabling Fast GPU Inference with Reduced Synchronization

URL: http://arxiv.org/abs/2602.07306v1
Date: Sat, 07 Feb 2026 01:42:20 GMT
Title: Parallel Track Transformers: Enabling Fast GPU Inference with Reduced Synchronization
Authors: Chong Wang, Nan Du, Tom Gunter, Tao Lei, Kulin Seth, Senyu Tong, Jianyu Wang, Guoli Yin, Xiyou Zhou, Kelvin Zou, Ruoming Pang,
Abstract summary: Parallel Track (PT) Transformer is a novel architectural paradigm that restructures to minimize cross-device dependencies.<n>We report consistent improvements in serving efficiency, including up to 15-30% reduced time to first token, 2-12% reduced time per output token, and up to 31.90% increased throughput in both settings.
Score: 19.97521786735984
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Efficient large-scale inference of transformer-based large language models (LLMs) remains a fundamental systems challenge, frequently requiring multi-GPU parallelism to meet stringent latency and throughput targets. Conventional tensor parallelism decomposes matrix operations across devices but introduces substantial inter-GPU synchronization, leading to communication bottlenecks and degraded scalability. We propose the Parallel Track (PT) Transformer, a novel architectural paradigm that restructures computation to minimize cross-device dependencies. PT achieves up to a 16x reduction in synchronization operations relative to standard tensor parallelism, while maintaining competitive model quality in our experiments. We integrate PT into two widely adopted LLM serving stacks-Tensor-RT-LLM and vLLM-and report consistent improvements in serving efficiency, including up to 15-30% reduced time to first token, 2-12% reduced time per output token, and up to 31.90% increased throughput in both settings.

Related papers

Scaling State-Space Models on Multiple GPUs with Tensor Parallelism [0.24148976266903474]
Selective state space models (SSMs) have rapidly become a compelling backbone for large language models.<n>But in deployment, their inference performance is often bounded by the memory capacity, bandwidth, and latency limits of a single GPU.<n>This paper presents a communication-efficient TP design for selective SSM inference that addresses three practical engineering challenges.
arXiv Detail & Related papers (2026-02-24T17:47:54Z)
AsyncMesh: Fully Asynchronous Optimization for Data and Pipeline Parallelism [54.8494905524997]
We introduce asynchronous updates across both parallelism axes, relaxing the co-location requirement.<n>We provide convergence guarantees for both sparse averaging and asynchronous updates.<n>Experiments on large-scale language models demonstrate that our approach matches the performance of the fully synchronous baseline.
arXiv Detail & Related papers (2026-01-30T01:24:47Z)
DeepCoT: Deep Continual Transformers for Real-Time Inference on Data Streams [63.27233749591346]
Transformer-based models have dramatically increased their size and parameter count to tackle increasingly complex tasks.<n>Stream data inference is typically performed over a sliding temporal window, leading to highly redundant computations.<n>We propose the Deep Continual Transformer (DeepCoT), a redundancy-free encoder-only model that can be applied over existing deep encoder architectures with minimal changes.
arXiv Detail & Related papers (2025-11-21T16:15:43Z)
Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs [61.953548065938385]
We introduce the ''Three Taxes'' (Bulk Synchronous, Inter- Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework.<n>We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution.<n>We observe a 10-20% speedup in end-to-end latency over BSP-based approaches.
arXiv Detail & Related papers (2025-11-04T01:15:44Z)
PT$^2$-LLM: Post-Training Ternarization for Large Language Models [52.4629647715623]
Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment.<n>We propose PT$2$-LLM, a post-training ternarization framework tailored for LLMs.<n>At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline.
arXiv Detail & Related papers (2025-09-27T03:01:48Z)
CollaPipe: Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks [57.95170323315603]
We introduce CollaPipe, a distributed learning framework that integrates collaborative pipeline parallelism with federated aggregation to support self-evolving networks.<n>In CollaPipe, the encoder part is adaptively partitioned into variable-sized segments and deployed across mobile devices for pipeline-parallel training, while the decoder is deployed on edge servers to handle generative tasks.<n>To enhance training efficiency, we formulate a joint optimization problem that adaptively allocates model segments, micro-batches, bandwidth, and transmission power.
arXiv Detail & Related papers (2025-09-24T07:54:01Z)
ASPD: Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs [34.477777651648914]
Large language models (LLMs) pose significant inference latency challenges due to their autoregressive decoding paradigm.<n>We propose an Adaptive Serial-Parallel Decoding (ASPD) which addresses two core challenges: automated construction of parallelizable data and efficient parallel decoding mechanism.<n>Our framework sets a groundbreaking benchmark for efficient LLM parallel inference, paving the way for its deployment in latency-sensitive applications such as AI-powered customer service bots and answer retrieval engines.
arXiv Detail & Related papers (2025-08-12T12:35:55Z)
Communication-Efficient Multi-Device Inference Acceleration for Transformer Models [19.938589623698338]
Transformer models power many AI applications but suffer from high inference latency, limiting their use in real-time settings.<n>We propose ASTRA, a communication-efficient framework that accelerates Transformer inference through a novel integration of sequence parallelism and a Mixed-Precision Attention mechanism designed to minimize inter-device communication.<n>ASTRA achieves up to 2.64X speedups over single-device inference and up to 15.25X speedups over state-of-the-art multi-device inferences, while operating under bandwidths as low as 10 Mbps.
arXiv Detail & Related papers (2025-05-25T22:16:59Z)
Ladder-residual: parallelism-aware architecture for accelerating large model inference with communication overlapping [36.71999572939612]
We introduce Ladder Residual, a simple architectural modification applicable to all residual-based models.<n>Applying Ladder Residual to all its layers can achieve 29% end-to-end wall clock speed up at inference time with TP sharding over 8 devices.<n>We train a 1B and 3B Ladder Transformer from scratch and observe comparable performance to a standard dense transformer baseline.
arXiv Detail & Related papers (2025-01-11T17:06:30Z)
Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference [8.527031391688283]
Kraken is an evolution of the standard Transformer architecture for efficient inference on multi-device systems. When trained on OpenWebText, Kraken models reach a similar perplexity as standard Transformers. When tested on the SuperGLUE benchmark, Kraken speeds up Time To First Token by a mean of 35.6% across a range of model sizes.
arXiv Detail & Related papers (2024-08-14T20:24:03Z)
Gated Linear Attention Transformers with Hardware-Efficient Training [60.670102007737476]
This work describes a hardware-efficient algorithm for linear attention that trades off memory movement against parallelizability. We then generalize this algorithm to a more expressive variant of linear attention with data-dependent gates. When used as a replacement for the standard attention layer in Transformers, the resulting gated linear attention Transformer is found to perform competitively.
arXiv Detail & Related papers (2023-12-11T18:51:59Z)
EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference [82.1584439276834]
Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural language processing (NLP) tasks. We present EdgeBERT, an in-depth algorithm- hardware co-design for latency-aware energy optimization for multi-task NLP.
arXiv Detail & Related papers (2020-11-28T19:21:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.