Fugu-MT 論文翻訳(概要): Morphlux: Programmable chip-to-chip photonic fabrics in multi-accelerator servers for ML

論文の概要: Morphlux: Programmable chip-to-chip photonic fabrics in multi-accelerator servers for ML

arxiv url: http://arxiv.org/abs/2508.03674v1
Date: Sun, 20 Jul 2025 12:40:21 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-10 09:30:49.321596
Title: Morphlux: Programmable chip-to-chip photonic fabrics in multi-accelerator servers for ML
Title（参考訳）: Morphlux:ML用マルチアクセラレータサーバにおけるプログラム可能なチップツーチップフォトニックファブリック
Authors: Abhishek Vijaya Kumar, Eric Ding, Arjun Devraj, Rachee Singh,
Abstract要約: サーバ内のアクセラレータを相互接続するための,サーバスケールのプログラマブルフォトニックファブリックであるMorphluxを開発した。 Morphluxはテナント計算割り当ての帯域幅を最大66%改善し、計算フラグメンテーションを最大70%削減することができる。ハードウェアテストベッドでサーバスケールのファブリックを高速にプログラミングすることで、Morphluxは失敗したアクセラレータチップを1.2秒で置き換えることができます。
参考スコア（独自算出の注目度）: 2.281165524297844
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We optically interconnect accelerator chips (e.g., GPUs, TPUs) within compute servers using newly viable programmable chip-to-chip photonic fabrics. In contrast, today, commercial multi-accelerator compute servers that are workhorses of ML, use electrical interconnects to network accelerator chips in the server. However, recent trends have shown an interconnect bandwidth wall caused by accelerator FLOPS scaling at a faster rate than the bandwidth of the interconnect between accelerators in the same server. This has led to under-utilization and idling of GPU resources in cloud datacenters. We develop Morphlux, a server-scale programmable photonic fabric, to interconnect accelerators within servers. We show that augmenting state-of-the-art photonic ML-centric datacenters with Morphlux can improve the bandwidth of tenant compute allocations by up to 66% and reduce compute fragmentation by up to 70%. We develop a novel end-to-end hardware prototype of Morphlux to demonstrate these performance benefits, which translate to 1.72x improvement in training throughput of ML models. By rapidly programming the server-scale fabric in our hardware testbed, Morphlux can logically replace a failed accelerator chip in 1.2 seconds.
Abstract（参考訳）: 我々は,新たに実現可能なプログラム可能なチップ・ツー・チップ・フォトニック・ファブリックを用いて,コンピュータ・サーバ内のアクセラレータ・チップ(GPU,TPUなど)を光学的に相互接続する。対照的に、今日では、MLのワークホースである商用マルチアクセラレータ計算サーバは、サーバ内のネットワークアクセラレータチップに電気的相互接続を使用する。しかし最近の傾向は、アクセラレータFLOPSスケーリングによる相互接続の帯域幅の壁が、同じサーバ内のアクセラレータ間の相互接続の帯域幅よりも高速であることを示している。これにより、クラウドデータセンタにおけるGPUリソースの未使用とアイドリングが実現した。サーバ内のアクセラレータを相互接続する,サーバスケールのプログラマブルフォトニックファブリックであるMorphluxを開発した。 Morphluxによる最先端のフォトニックML中心データセンターの拡張により、テナント計算割り当ての帯域幅を最大66%改善し、計算フラグメンテーションを最大70%削減できることを示す。 MLモデルのトレーニングスループットが1.72倍向上したMorphluxの新たなエンドツーエンドハードウェアプロトタイプを開発した。ハードウェアテストベッドでサーバスケールのファブリックを高速にプログラミングすることで、Morphluxは論理的に失敗したアクセラレータチップを1.2秒で置き換えることができます。

論文の概要: Morphlux: Programmable chip-to-chip photonic fabrics in multi-accelerator servers for ML

関連論文リスト