Fugu-MT 論文翻訳(概要): TensorHub: Scalable and Elastic Weight Transfer for LLM RL Training

論文の概要: TensorHub: Scalable and Elastic Weight Transfer for LLM RL Training

arxiv url: http://arxiv.org/abs/2604.09107v1
Date: Fri, 10 Apr 2026 08:40:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-13 17:57:53.780938
Title: TensorHub: Scalable and Elastic Weight Transfer for LLM RL Training
Title（参考訳）: TensorHub: LLM RLトレーニングのためのスケーラブルで弾力的なウェイトトランスファー
Authors: Chenhao Ye, Huaizheng Zhang, Mingcong Han, Baoquan Zhong, Xiang Li, Qixiang Chen, Xinyi Zhang, Weidong Zhang, Kaihua Jiang, Wang Zhang, He Sun, Wencong Xiao, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau,
Abstract要約: RL重量移動のための新しいストレージ抽象化であるROS(Reference-Oriented Storage)を提案する。 ROSは、モデルの重みの特定のバージョンが保存され、必要に応じて取得できるという錯覚を提示している。当社は,強い一貫性とフォールトトレランスを備えた,製品品質のシステムであるHubを構築しています。
参考スコア（独自算出の注目度）: 15.115281691940721
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern LLM reinforcement learning (RL) workloads require a highly efficient weight transfer system to scale training across heterogeneous computational resources. However, existing weight transfer approaches either fail to provide flexibility for dynamically scaling clusters or incur fundamental data movement overhead, resulting in poor performance. We introduce Reference-Oriented Storage (ROS), a new storage abstraction for RL weight transfer that exploits the highly replicated model weights in place. ROS presents the illusion that certain versions of the model weights are stored and can be fetched on demand. Underneath, ROS does not physically store any copies of the weights; instead, it tracks the workers that hold these weights on GPUs for inference. Upon request, ROS directly uses them to serve reads. We build TensorHub, a production-quality system that extends the ROS idea with topology-optimized transfer, strong consistency, and fault tolerance. Evaluation shows that TensorHub fully saturates RDMA bandwidth and adapts to three distinct rollout workloads with minimal engineering effort. Specifically, TensorHub reduces total GPU stall time by up to 6.7x for standalone rollouts, accelerates weight update for elastic rollout by 4.8x, and cuts cross-datacenter rollout stall time by 19x. TensorHub has been deployed in production to support cutting-edge RL training.
Abstract（参考訳）: 現代のLLM強化学習(RL)ワークロードは、不均一な計算資源にわたってトレーニングをスケールするために、非常に効率的な重量移動システムを必要とする。しかし、既存のウェイトトランスファーアプローチは、クラスタを動的にスケーリングするための柔軟性を提供できないか、あるいは基本的なデータ移動オーバーヘッドを発生させるため、パフォーマンスが低下する。 RL重み転送のための新しいストレージ抽象化であるReference-Oriented Storage (ROS)を導入する。 ROSは、モデルの重みの特定のバージョンが保存され、必要に応じて取得できるという錯覚を提示している。内部では、ROSは重みのコピーを物理的に保存せず、推論のためにGPU上の重みを保持する労働者を追跡する。要求に応じて、ROSはそれらを直接読み取りに使用する。 TensorHubは、トポロジに最適化された転送、強い一貫性、耐障害性を備えた、ROSのアイデアを拡張する製品品質のシステムです。 TensorHubはRDMAの帯域幅を完全に飽和させ、最小限のエンジニアリング労力で3つの異なるロールアウトワークロードに適応することを示している。具体的には、TensorHubはスタンドアロンのロールアウトでGPUの停止時間を最大6.7倍に削減し、弾力性のあるロールアウトの重量更新を4.8倍に加速し、データセンター間のロールアウトの停止時間を19倍に短縮する。 TensorHubは最先端のRLトレーニングをサポートするために本番環境にデプロイされている。

論文の概要: TensorHub: Scalable and Elastic Weight Transfer for LLM RL Training

関連論文リスト