Fugu-MT 論文翻訳(概要): Learning Rollout from Sampling:An R1-Style Tokenized Traffic Simulation Model

論文の概要: Learning Rollout from Sampling:An R1-Style Tokenized Traffic Simulation Model

arxiv url: http://arxiv.org/abs/2603.24989v1
Date: Thu, 26 Mar 2026 03:29:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-27 20:52:48.079313
Title: Learning Rollout from Sampling:An R1-Style Tokenized Traffic Simulation Model
Title（参考訳）: サンプリングによるロールアウト学習:R1-Style Tokenized Traffic Simulation Model
Authors: Ziyan Wang, Peng Chen, Ding Li, Chiwei Li, Qichao Zhang, Zhongpu Xia, Guizhen Yu,
Abstract要約: R1Simは、運動トークンエントロピーパターンに基づく強化学習を最初に試みる試みである。エントロピー誘導型適応サンプリング機構を導入し,不確実性が高いが高い確率で見落とされた動きトークンに着目した。全体として、これらのコンポーネントは多種多様な高不確実性サンプリングとグループレベルの比較評価を通じて、バランスの取れた探索・探索のトレードオフを可能にする。
参考スコア（独自算出の注目度）: 21.835465637680798
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Learning diverse and high-fidelity traffic simulations from human driving demonstrations is crucial for autonomous driving evaluation. The recent next-token prediction (NTP) paradigm, widely adopted in large language models (LLMs), has been applied to traffic simulation and achieves iterative improvements via supervised fine-tuning (SFT). However, such methods limit active exploration of potentially valuable motion tokens, particularly in suboptimal regions. Entropy patterns provide a promising perspective for enabling exploration driven by motion token uncertainty. Motivated by this insight, we propose a novel tokenized traffic simulation policy, R1Sim, which represents an initial attempt to explore reinforcement learning based on motion token entropy patterns, and systematically analyzes the impact of different motion tokens on simulation outcomes. Specifically, we introduce an entropy-guided adaptive sampling mechanism that focuses on previously overlooked motion tokens with high uncertainty yet high potential. We further optimize motion behaviors using Group Relative Policy Optimization (GRPO), guided by a safety-aware reward design. Overall, these components enable a balanced exploration-exploitation trade-off through diverse high-uncertainty sampling and group-wise comparative estimation, resulting in realistic, safe, and diverse multi-agent behaviors. Extensive experiments on the Waymo Sim Agent benchmark demonstrate that R1Sim achieves competitive performance compared to state-of-the-art methods.
Abstract（参考訳）: 人間の運転実験から多種多様かつ高忠実な交通シミュレーションを学習することは、自動運転評価に不可欠である。近年,大規模言語モデル (LLM) に広く採用されているNTPパラダイムが交通シミュレーションに適用され,教師付き微調整 (SFT) による反復的改善を実現している。しかし、このような手法は、特に準最適領域において、潜在的に価値のある動きトークンの活発な探索を制限している。エントロピーパターンは、運動トークンの不確実性によって駆動される探索を可能にするための有望な視点を提供する。そこで本研究では,移動トークンのエントロピーパターンに基づく強化学習を最初に試み,異なる動作トークンがシミュレーション結果に与える影響を体系的に解析する,新しいトークン化トラフィックシミュレーションポリシーR1Simを提案する。具体的にはエントロピー誘導型適応サンプリング機構を導入し,不確実性が高いが高い可能性を持つ従来見過ごされていた動きトークンに着目した。グループ相対政策最適化(GRPO)を用いて、安全に配慮した報酬設計により、動きの挙動をさらに最適化する。全体として、これらのコンポーネントは、多種多様な高不確かさサンプリングとグループレベルの比較評価を通じて、バランスの取れた探索・探索のトレードオフを可能にし、現実的で安全で多様なマルチエージェントの振る舞いをもたらす。 Waymo Sim Agentベンチマークの大規模な実験は、R1Simが最先端の手法と比較して競争性能を達成することを示した。

論文の概要: Learning Rollout from Sampling:An R1-Style Tokenized Traffic Simulation Model

関連論文リスト