Fugu-MT 論文翻訳(概要): Reinforcement Learning with Discrete Diffusion Policies for Combinatorial Action Spaces

論文の概要: Reinforcement Learning with Discrete Diffusion Policies for Combinatorial Action Spaces

arxiv url: http://arxiv.org/abs/2509.22963v2
Date: Wed, 01 Oct 2025 00:48:42 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-02 12:11:26.77378
Title: Reinforcement Learning with Discrete Diffusion Policies for Combinatorial Action Spaces
Title（参考訳）: 離散拡散係数を用いた組合せ行動空間の強化学習
Authors: Haitong Ma, Ofir Nabati, Aviv Rosenberg, Bo Dai, Oran Lang, Idan Szpektor, Craig Boutilier, Na Li, Shie Mannor, Lior Shani, Guy Tenneholtz,
Abstract要約: 強化学習(Reinforcement Learning, RL)は、現実の多くの問題に共通する大規模なアクション空間にスケールするために苦労する。本稿では、複雑な環境下での高効率なポリシーとして、離散拡散モデルを訓練するための新しいフレームワークを提案する。
参考スコア（独自算出の注目度）: 57.466101098183884
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning (RL) struggles to scale to large, combinatorial action spaces common in many real-world problems. This paper introduces a novel framework for training discrete diffusion models as highly effective policies in these complex settings. Our key innovation is an efficient online training process that ensures stable and effective policy improvement. By leveraging policy mirror descent (PMD) to define an ideal, regularized target policy distribution, we frame the policy update as a distributional matching problem, training the expressive diffusion model to replicate this stable target. This decoupled approach stabilizes learning and significantly enhances training performance. Our method achieves state-of-the-art results and superior sample efficiency across a diverse set of challenging combinatorial benchmarks, including DNA sequence generation, RL with macro-actions, and multi-agent systems. Experiments demonstrate that our diffusion policies attain superior performance compared to other baselines.
Abstract（参考訳）: 強化学習(Reinforcement Learning, RL)は、現実の多くの問題に共通する大規模な複合的な行動空間にスケールするために苦労する。本稿では、これらの複雑な設定において、高効率なポリシーとして離散拡散モデルを訓練するための新しいフレームワークを提案する。私たちの重要な革新は、安定的で効果的な政策改善を保証する効率的なオンライントレーニングプロセスです。ポリシーミラー降下(PMD)を利用して、理想的な正規化されたターゲットポリシー分布を定義することにより、ポリシー更新を分布整合問題とみなし、この安定なターゲットを再現するために表現拡散モデルを訓練する。この分離されたアプローチは学習を安定させ、トレーニングパフォーマンスを大幅に向上させる。提案手法は,DNAシークエンス生成,マクロアクション付きRL,マルチエージェントシステムなど,多種多様な組み合わせベンチマークにおいて,最先端の結果と優れたサンプル効率を実現する。実験により,拡散ポリシは他のベースラインよりも優れた性能が得られることが示された。

論文の概要: Reinforcement Learning with Discrete Diffusion Policies for Combinatorial Action Spaces

関連論文リスト