InfiniteHBD: Building Datacenter-Scale High-Bandwidth Domain for LLM with Optical Circuit Switching Transceivers
- URL: http://arxiv.org/abs/2502.03885v6
- Date: Mon, 04 Aug 2025 02:36:49 GMT
- Title: InfiniteHBD: Building Datacenter-Scale High-Bandwidth Domain for LLM with Optical Circuit Switching Transceivers
- Authors: Chenchen Shou, Guyue Liu, Hao Nie, Huaiyu Meng, Yu Zhou, Yimin Jiang, Wenqing Lv, Yelong Xu, Yuanwei Lu, Zhang Chen, Yanbo Yu, Yichen Shen, Yibo Zhu, Daxin Jiang,
- Abstract summary: High-Bandwidth Domains (HBDs) are critical for communication-intensive parallelism like Parallelism.<n>Switch-centric HBDs incur prohibitive scaling costs, while GPU-centric HBDs suffer from severe fault propagation.<n>We propose InfiniteHBD, a transceiver-centric HBD architecture that integrates connectivity and dynamic switching at the transceiver level.
- Score: 37.89954553921228
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scaling Large Language Model (LLM) training relies on multi-dimensional parallelism, where High-Bandwidth Domains (HBDs) are critical for communication-intensive parallelism like Tensor Parallelism. However, existing HBD architectures face fundamental limitations in scalability, cost, and fault resiliency: switch-centric HBDs (e.g., NVL-72) incur prohibitive scaling costs, while GPU-centric HBDs (e.g., TPUv3/Dojo) suffer from severe fault propagation. Switch-GPU hybrid HBDs (e.g., TPUv4) take a middle-ground approach, but the fault explosion radius remains large. We propose InfiniteHBD, a transceiver-centric HBD architecture that integrates connectivity and dynamic switching at the transceiver level by embedding Optical Circuit Switching (OCS) within each transceiver. It enables reconfigurable point-to-multipoint communication and scalable variable-size ring topologies. InfiniteHBD achieves datacenter-scale scalability without cost explosion, fault isolation at the node level, and full bandwidth utilization for healthy GPUs. Key innovations include a Silicon Photonic-based OCS transceiver (OCSTrx), a reconfigurable k-hop ring topology, and an HBD-DCN orchestration algorithm. The evaluation demonstrates that InfiniteHBD reduces cost to 31% of NVL-72, achieves a near-zero GPU waste ratio (over 10x lower than NVL-72 and TPUv4), maintains near-zero cross-ToR traffic under 7% node fault ratio, and improves Model FLOPs Utilization by 3.37x compared to NVIDIA DGX (8 GPUs/node).
Related papers
- Fully-analog array signal processor using 3D aperture engineering [13.863335862091423]
We present a fully-analog array signal processor (FASP) using 3D aperture engineering framework.<n>FASP performs super-resolution direction-of-arrival estimation, source number estimation, and multi-channel source separation.<n>Experiments further validate the source number estimation and independent channel separation of 10-target that can suppress radar jamming signals by 20 dB.
arXiv Detail & Related papers (2026-03-01T08:50:10Z) - Accelerating Frontier MoE Training with 3D Integrated Optics [0.0]
3D-stacked optics and logic offers a transformative, power-efficient scale-up solution for connecting hundreds of GPU packages.<n>We show that the substantial increases in bandwidth and radix enabled by 3D CPO allow for an 8X increase in scale-up capability.
arXiv Detail & Related papers (2025-09-09T00:41:42Z) - Protocol Models: Scaling Decentralized Training with Communication-Efficient Model Parallelism [59.79227116582264]
Scaling models has led to significant advancements in deep learning, but training these models in decentralized settings remains challenging.<n>We propose a novel compression algorithm that compresses both forward and backward passes, enabling up to 99% compression with no convergence degradation.
arXiv Detail & Related papers (2025-06-02T02:19:22Z) - PointODE: Lightweight Point Cloud Learning with Neural Ordinary Differential Equations on Edge [0.8403582577557918]
We introduce a parameter-efficient architecture for point cloud feature extraction based on a continuous stack of blocks with residual connections.<n>PointODE shows competitive accuracy to the state-of-the-art models on both synthetic and real-world datasets.
arXiv Detail & Related papers (2025-05-31T07:34:54Z) - Beyond Terabit/s Integrated Neuromorphic Photonic Processor for DSP-Free Optical Interconnects [1.9685853627153866]
Multi-scale AI training and inference demand uniform, ultra-low latency, and energy-efficient links.
We present an integrated neuromorphic optical signal processor (OSP) that achieves DSP-free, all-optical, real-time processing.
This research provides a highly scalable, energy-efficient, and high-speed solution, paving the way for next-generation AI infrastructure.
arXiv Detail & Related papers (2025-04-21T11:56:36Z) - Scalable Low-overhead Superconducting Non-local Coupler with Exponentially Enhanced Connectivity [9.54190299683856]
Quantum error correction codes with non-local connections incur lower overhead and outperform surface codes on large-scale devices.
We experimentally demonstrate a convenient on-chip coupler of centimeters long and propose an extra coupler layer to map the qubit array to a binary-tree connecting graph.
With the scalable binary tree structure and high-fidelity non-local entanglement, novel quantum algorithms can be implemented on the superconducting qubit system.
arXiv Detail & Related papers (2025-02-26T07:29:59Z) - An Optical Interconnect for Modular Quantum Computers [0.44624755182670844]
scaling up quantum computers requires an optical interconnect.
We propose a multi-group structure where the group switch routes photons emitted by computational end nodes.
We implement a prototype three-node switched interconnect and create two-hop entanglement with fidelities of at least 0.6.
arXiv Detail & Related papers (2024-12-12T14:16:50Z) - FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - Hyperdimensional Computing Empowered Federated Foundation Model over Wireless Networks for Metaverse [56.384390765357004]
We propose an integrated federated split learning and hyperdimensional computing framework for emerging foundation models.
This novel approach reduces communication costs, computation load, and privacy risks, making it suitable for resource-constrained edge devices in the Metaverse.
arXiv Detail & Related papers (2024-08-26T17:03:14Z) - fVDB: A Deep-Learning Framework for Sparse, Large-Scale, and High-Performance Spatial Intelligence [50.417261057533786]
fVDB is a novel framework for deep learning on large-scale 3D data.
Our framework is fully integrated with PyTorch enabling interoperability with existing pipelines.
arXiv Detail & Related papers (2024-07-01T20:20:33Z) - BDC-Occ: Binarized Deep Convolution Unit For Binarized Occupancy Network [55.21288428359509]
Existing 3D occupancy networks demand significant hardware resources, hindering the deployment of edge devices.
We propose a novel binarized deep convolution (BDC) unit that effectively enhances performance while increasing the number of binarized convolutional layers.
Our BDC-Occ model is created by applying the proposed BDC unit to binarize the existing 3D occupancy networks.
arXiv Detail & Related papers (2024-05-27T10:44:05Z) - Marsellus: A Heterogeneous RISC-V AI-IoT End-Node SoC with 2-to-8b DNN
Acceleration and 30%-Boost Adaptive Body Biasing [11.27712965055613]
Marsellus is an all-digital heterogeneous system-on-a-Chip for AI-IoT end-nodes fabricated in GlobalFoundries 22nm FDX.
It achieves up to 180 Gop/s or 3.32 Top/s/W on 2-bit precision arithmetic in software, and up to 637 Gop/s or 12.4 Top/s/W on hardware-accelerated DNN layers.
arXiv Detail & Related papers (2023-05-15T07:48:50Z) - Non-Coherent Over-the-Air Decentralized Gradient Descent [0.0]
Implementing Decentralized Gradient Descent in wireless systems is challenging due to noise, fading, and limited bandwidth.
This paper introduces a scalable DGD algorithm that eliminates the need for scheduling, topology information, or CSI.
arXiv Detail & Related papers (2022-11-19T19:15:34Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware
Multi-Task NLP Inference [82.1584439276834]
Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural language processing (NLP) tasks.
We present EdgeBERT, an in-depth algorithm- hardware co-design for latency-aware energy optimization for multi-task NLP.
arXiv Detail & Related papers (2020-11-28T19:21:47Z) - Adaptive Subcarrier, Parameter, and Power Allocation for Partitioned
Edge Learning Over Broadband Channels [69.18343801164741]
partitioned edge learning (PARTEL) implements parameter-server training, a well known distributed learning method, in wireless network.
We consider the case of deep neural network (DNN) models which can be trained using PARTEL by introducing some auxiliary variables.
arXiv Detail & Related papers (2020-10-08T15:27:50Z) - LCP: A Low-Communication Parallelization Method for Fast Neural Network
Inference in Image Recognition [33.581285906182075]
We propose a low-communication parallelization (LCP) method in which models consist of several almost-independent and narrow branches.
We deploy LCP models on three distributed systems: AWS instances, Raspberry Pis, and PYNQ boards.
LCP models achieve a maximum and average speedups of 56x and 7x, compared to the originals, which could be improved by up to an average speedup of 33x.
arXiv Detail & Related papers (2020-03-13T19:52:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.