Advancing Multi-agent Traffic Simulation via R1-Style Reinforcement Fine-Tuning
- URL: http://arxiv.org/abs/2509.23993v1
- Date: Sun, 28 Sep 2025 17:36:13 GMT
- Title: Advancing Multi-agent Traffic Simulation via R1-Style Reinforcement Fine-Tuning
- Authors: Muleilan Pei, Shaoshuai Shi, Shaojie Shen,
- Abstract summary: We propose a novel R1-style reinforcement fine-tuning paradigm tailored for next-token prediction models to better align agent behavior with human preferences and evaluation metrics.<n>Our approach introduces a metric-oriented policy optimization algorithm to improve distribution alignment and an iterative "SFT-RFT-SFT" training strategy that alternates between Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT)<n>The results on the Open Sim Agents Challenge showcase that SMART-R1 achieves state-of-the-art performance with an overall realism meta score of 0.7858, ranking first on
- Score: 35.83999932977034
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scalable and realistic simulation of multi-agent traffic behavior is critical for advancing autonomous driving technologies. Although existing data-driven simulators have made significant strides in this domain, they predominantly rely on supervised learning to align simulated distributions with real-world driving scenarios. A persistent challenge, however, lies in the distributional shift that arises between training and testing, which often undermines model generalization in unseen environments. To address this limitation, we propose SMART-R1, a novel R1-style reinforcement fine-tuning paradigm tailored for next-token prediction models to better align agent behavior with human preferences and evaluation metrics. Our approach introduces a metric-oriented policy optimization algorithm to improve distribution alignment and an iterative "SFT-RFT-SFT" training strategy that alternates between Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) to maximize performance gains. Extensive experiments on the large-scale Waymo Open Motion Dataset (WOMD) validate the effectiveness of this simple yet powerful R1-style training framework in enhancing foundation models. The results on the Waymo Open Sim Agents Challenge (WOSAC) showcase that SMART-R1 achieves state-of-the-art performance with an overall realism meta score of 0.7858, ranking first on the leaderboard at the time of submission.
Related papers
- Model-Based Policy Adaptation for Closed-Loop End-to-End Autonomous Driving [54.46325690390831]
We propose Model-based Policy Adaptation (MPA), a general framework that enhances the robustness and safety of pretrained E2E driving agents during deployment.<n>MPA first generates diverse counterfactual trajectories using a geometry-consistent simulation engine.<n>MPA trains a diffusion-based policy adapter to refine the base policy's predictions and a multi-step Q value model to evaluate long-term outcomes.
arXiv Detail & Related papers (2025-11-26T17:01:41Z) - SPACeR: Self-Play Anchoring with Centralized Reference Models [50.55045557371374]
Sim agent policies are realistic, human-like, fast, and scalable in multi-agent settings.<n>Recent progress in imitation learning with large diffusion-based or tokenized models has shown that behaviors can be captured directly from human driving data.<n>We propose SPACeR, a framework that leverages a pretrained tokenized autoregressive motion model as a central reference policy.
arXiv Detail & Related papers (2025-10-20T19:53:02Z) - First Order Model-Based RL through Decoupled Backpropagation [10.963895023346879]
We propose an approach that decouples trajectory generation from gradient computation.<n>Our method achieves the sample efficiency and speed of specialized locomotions such as SHAC.<n>We empirically validate our gradient algorithm on benchmark control tasks and demonstrate its effectiveness on a real Go2 quadruped robot.
arXiv Detail & Related papers (2025-08-29T19:55:25Z) - Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling [43.835234728790795]
Prefix-RFT is a hybrid approach that synergizes learning from both demonstration and exploration.<n>It not only surpasses the performance of standalone SFT and RFT but also outperforms parallel mixed-policy RFT methods.
arXiv Detail & Related papers (2025-07-02T13:04:09Z) - RIFT: Group-Relative RL Fine-Tuning for Realistic and Controllable Traffic Simulation [13.319344167881383]
We introduce a dual-stage AV-centric simulation framework that conducts imitation learning pre-training in a data-driven simulator.<n>We then learn fine-tuning in a physics-based simulator to enhance style-level controllability.<n>In the fine-tuning stage, we propose RIFT, a novel group-relative RL fine-tuning strategy.
arXiv Detail & Related papers (2025-05-06T09:12:37Z) - From Imitation to Exploration: End-to-end Autonomous Driving based on World Model [24.578178308010912]
RAMBLE is an end-to-end world model-based RL method for driving decision-making.<n>It can handle complex and dynamic traffic scenarios.<n>It achieves state-of-the-art performance in route completion rate on the CARLA Leaderboard 1.0 and completes all 38 scenarios on the CARLA Leaderboard 2.0.
arXiv Detail & Related papers (2024-10-03T06:45:59Z) - Autonomous Vehicle Controllers From End-to-End Differentiable Simulation [57.278726604424556]
We propose a differentiable simulator and design an analytic policy gradients (APG) approach to training AV controllers.<n>Our proposed framework brings the differentiable simulator into an end-to-end training loop, where gradients of environment dynamics serve as a useful prior to help the agent learn a more grounded policy.<n>We find significant improvements in performance and robustness to noise in the dynamics, as well as overall more intuitive human-like handling.
arXiv Detail & Related papers (2024-09-12T11:50:06Z) - SMART: Scalable Multi-agent Real-time Motion Generation via Next-token Prediction [4.318757942343036]
We introduce a novel autonomous driving motion generation paradigm that models vectorized map and agent trajectory data into discrete sequence tokens.
These tokens are then processed through a decoder-only transformer architecture to train for the next token prediction task.
We have collected over 1 billion motion tokens from multiple datasets, validating the model's scalability.
arXiv Detail & Related papers (2024-05-24T16:17:35Z) - SAFE-SIM: Safety-Critical Closed-Loop Traffic Simulation with Diffusion-Controllable Adversaries [94.84458417662407]
We introduce SAFE-SIM, a controllable closed-loop safety-critical simulation framework.
Our approach yields two distinct advantages: 1) generating realistic long-tail safety-critical scenarios that closely reflect real-world conditions, and 2) providing controllable adversarial behavior for more comprehensive and interactive evaluations.
We validate our framework empirically using the nuScenes and nuPlan datasets across multiple planners, demonstrating improvements in both realism and controllability.
arXiv Detail & Related papers (2023-12-31T04:14:43Z) - Waymax: An Accelerated, Data-Driven Simulator for Large-Scale Autonomous
Driving Research [76.93956925360638]
Waymax is a new data-driven simulator for autonomous driving in multi-agent scenes.
It runs entirely on hardware accelerators such as TPUs/GPUs and supports in-graph simulation for training.
We benchmark a suite of popular imitation and reinforcement learning algorithms with ablation studies on different design decisions.
arXiv Detail & Related papers (2023-10-12T20:49:15Z) - Reinforcement Learning with Human Feedback for Realistic Traffic
Simulation [53.85002640149283]
Key element of effective simulation is the incorporation of realistic traffic models that align with human knowledge.
This study identifies two main challenges: capturing the nuances of human preferences on realism and the unification of diverse traffic simulation models.
arXiv Detail & Related papers (2023-09-01T19:29:53Z) - When to Update Your Model: Constrained Model-based Reinforcement
Learning [50.74369835934703]
We propose a novel and general theoretical scheme for a non-decreasing performance guarantee of model-based RL (MBRL)
Our follow-up derived bounds reveal the relationship between model shifts and performance improvement.
A further example demonstrates that learning models from a dynamically-varying number of explorations benefit the eventual returns.
arXiv Detail & Related papers (2022-10-15T17:57:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.