Fugu-MT 論文翻訳(概要): AP-DRL: A Synergistic Algorithm-Hardware Framework for Automatic Task Partitioning of Deep Reinforcement Learning on Versal ACAP

論文の概要: AP-DRL: A Synergistic Algorithm-Hardware Framework for Automatic Task Partitioning of Deep Reinforcement Learning on Versal ACAP

arxiv url: http://arxiv.org/abs/2603.29369v1
Date: Tue, 31 Mar 2026 07:41:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-01 15:25:03.259534
Title: AP-DRL: A Synergistic Algorithm-Hardware Framework for Automatic Task Partitioning of Deep Reinforcement Learning on Versal ACAP
Title（参考訳）: AP-DRL: 垂直ACAPによる深層強化学習の自動タスク分割のための合成アルゴリズム・ハードウエアフレームワーク
Authors: Enlai Li, Zhe Lin, Sharad Sinha, Wei Zhang,
Abstract要約: AP-DRLは、AMD Versal ACAPの異種アーキテクチャを利用する自動タスク分割フレームワークである。量子化の問題に対して、AP-DRLはハードウェア対応のアルゴリズムを用いて、FP32(CPU)、FP16(FPGA/DSP)、BF16(AIエンジン)の操作を、これらの精度フォーマットに対するVersal ACAPのネイティブサポートを活用する。総合的な実験によると、AP-DRLはプログラム可能なロジックで最大4.17$times$、AI Engineベースラインで最大3.82$times$を達成できる。
参考スコア（独自算出の注目度）: 12.174718779457828
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep reinforcement learning has demonstrated remarkable success across various domains. However, the tight coupling between training and inference processes makes accelerating DRL training an essential challenge for DRL optimization. Two key issues hinder efficient DRL training: (1) the significant variation in computational intensity across different DRL algorithms and even among operations within the same algorithm complicates hardware platform selection, while (2) DRL's wide dynamic range could lead to substantial reward errors with conventional FP16+FP32 mixed-precision quantization. While existing work has primarily focused on accelerating DRL for specific computing units or optimizing inference-stage quantization, we propose AP-DRL to address the above challenges. AP-DRL is an automatic task partitioning framework that harnesses the heterogeneous architecture of AMD Versal ACAP (integrating CPUs, FPGAs, and AI Engines) to accelerate DRL training through intelligent hardware-aware optimization. Our approach begins with bottleneck analysis of CPU, FPGA, and AIE performance across diverse DRL workloads, informing the design principles for AP-DRL's inter-component task partitioning and quantization optimization. The framework then addresses the challenge of platform selection through design space exploration-based profiling and ILP-based partitioning models that match operations to optimal computing units based on their computational characteristics. For the quantization challenge, AP-DRL employs a hardware-aware algorithm coordinating FP32 (CPU), FP16 (FPGA/DSP), and BF16 (AI Engine) operations by leveraging Versal ACAP's native support for these precision formats. Comprehensive experiments indicate that AP-DRL can achieve speedup of up to 4.17$\times$ over programmable logic and up to 3.82$\times$ over AI Engine baselines while maintaining training convergence.
Abstract（参考訳）: 深い強化学習は、様々な領域で顕著な成功を収めた。しかし、トレーニングと推論プロセスの密結合により、DRLのトレーニングを加速させることがDRL最適化の重要な課題となる。 1)異なるDRLアルゴリズムと同一アルゴリズム内の操作の間でも計算強度の有意な変動はハードウェアプラットフォームの選択を複雑にし、(2)DRLの広いダイナミックレンジは、従来のFP16+FP32混合精度量子化による相当な報酬誤差をもたらす可能性がある。既存の研究は主に特定の計算ユニットに対するDRLの高速化や推論段階の量子化の最適化に重点を置いているが、上記の課題に対処するためにAP-DRLを提案する。 AP-DRLは、AMD Versal ACAP(CPU、FPGA、AIエンジンを統合する)の異種アーキテクチャを利用して、インテリジェントなハードウェア対応最適化を通じてDRLトレーニングを加速する自動タスク分割フレームワークである。提案手法は,CPU,FPGA,AIEの性能を多種多様なDRLワークロードでボトルネック解析し,AP-DRLのタスク分割と量子化最適化の設計原理を述べる。このフレームワークは、設計空間探索に基づくプロファイリングとILPベースのパーティショニングモデルによるプラットフォーム選択の課題に対処する。量子化の問題に対して、AP-DRLはハードウェア対応のアルゴリズムを用いて、FP32(CPU)、FP16(FPGA/DSP)、BF16(AIエンジン)の操作を、これらの精度フォーマットに対するVersal ACAPのネイティブサポートを活用する。総合的な実験によると、AP-DRLはプログラム可能なロジックで最大4.17$\times$、AI Engineベースラインで最大3.82$\times$を達成でき、トレーニングコンバージェンスを維持している。

論文の概要: AP-DRL: A Synergistic Algorithm-Hardware Framework for Automatic Task Partitioning of Deep Reinforcement Learning on Versal ACAP

関連論文リスト