Related papers: Post-Training and Test-Time Scaling of Generative Agent Behavior Models for Interactive Autonomous Driving

Post-Training and Test-Time Scaling of Generative Agent Behavior Models for Interactive Autonomous Driving

URL: http://arxiv.org/abs/2512.13262v1
Date: Mon, 15 Dec 2025 12:18:50 GMT
Title: Post-Training and Test-Time Scaling of Generative Agent Behavior Models for Interactive Autonomous Driving
Authors: Hyunki Seong, Jeong-Kyun Lee, Heesoo Myeong, Yongho Shin, Hyun-Mook Cho, Duck Hoon Kim, Pranav Desai, Monu Surana,
Abstract summary: Group Relative Behavior Optimization improves safety performance by over 40% while preserving behavioral realism.<n>Warm-K is a warm-started Top-K sampling strategy that balances consistency and diversity in motion selection.
Score: 3.8612647047433217
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Learning interactive motion behaviors among multiple agents is a core challenge in autonomous driving. While imitation learning models generate realistic trajectories, they often inherit biases from datasets dominated by safe demonstrations, limiting robustness in safety-critical cases. Moreover, most studies rely on open-loop evaluation, overlooking compounding errors in closed-loop execution. We address these limitations with two complementary strategies. First, we propose Group Relative Behavior Optimization (GRBO), a reinforcement learning post-training method that fine-tunes pretrained behavior models via group relative advantage maximization with human regularization. Using only 10% of the training dataset, GRBO improves safety performance by over 40% while preserving behavioral realism. Second, we introduce Warm-K, a warm-started Top-K sampling strategy that balances consistency and diversity in motion selection. Our Warm-K method-based test-time scaling enhances behavioral consistency and reactivity at test time without retraining, mitigating covariate shift and reducing performance discrepancies. Demo videos are available in the supplementary material.

Related papers

Human-in-the-loop Online Rejection Sampling for Robotic Manipulation [55.99788088622936]
Hi-ORS stabilizes value estimation by filtering out negatively rewarded samples during online fine-tuning.<n>Hi-ORS fine-tunes a pi-base policy to master contact-rich manipulation in just 1.5 hours of real-world training.
arXiv Detail & Related papers (2025-10-30T11:53:08Z)
First Order Model-Based RL through Decoupled Backpropagation [10.963895023346879]
We propose an approach that decouples trajectory generation from gradient computation.<n>Our method achieves the sample efficiency and speed of specialized locomotions such as SHAC.<n>We empirically validate our gradient algorithm on benchmark control tasks and demonstrate its effectiveness on a real Go2 quadruped robot.
arXiv Detail & Related papers (2025-08-29T19:55:25Z)
Fast Adaptation with Behavioral Foundation Models [82.34700481726951]
Unsupervised zero-shot reinforcement learning has emerged as a powerful paradigm for pretraining behavioral foundation models.<n>Despite promising results, zero-shot policies are often suboptimal due to errors induced by the unsupervised training process.<n>We propose fast adaptation strategies that search in the low-dimensional task-embedding space of the pre-trained BFM to rapidly improve the performance of its zero-shot policies.
arXiv Detail & Related papers (2025-04-10T16:14:17Z)
HAD-Gen: Human-like and Diverse Driving Behavior Modeling for Controllable Scenario Generation [13.299893784290733]
HAD-Gen is a framework for realistic traffic scenario generation that simulates diverse human-like driving behaviors.<n>The proposed framework achieves a 90.96% goal-reaching rate, an off-road rate of 2.08%, and a collision rate of 6.91% in the generalization test.
arXiv Detail & Related papers (2025-03-19T09:38:45Z)
Bidirectional Decoding: Improving Action Chunking via Guided Test-Time Sampling [51.38330727868982]
We show how action chunking impacts the divergence between a learner and a demonstrator.<n>We propose Bidirectional Decoding (BID), a test-time inference algorithm that bridges action chunking with closed-loop adaptation.<n>Our method boosts the performance of two state-of-the-art generative policies across seven simulation benchmarks and two real-world tasks.
arXiv Detail & Related papers (2024-08-30T15:39:34Z)
On Learning the Tail Quantiles of Driving Behavior Distributions via Quantile Regression and Flows [13.540998552232006]
We consider the problem of learning models that accurately capture the diversity and tail quantiles of human driver behavior probability distributions. We adapt two flexible quantile learning frameworks for this setting that avoid strong distributional assumptions. We evaluate our approach in a one-step acceleration prediction task, and in multi-step driver simulation rollouts.
arXiv Detail & Related papers (2023-05-22T15:09:04Z)
EUCLID: Towards Efficient Unsupervised Reinforcement Learning with Multi-choice Dynamics Model [46.99510778097286]
Unsupervised reinforcement learning (URL) poses a promising paradigm to learn useful behaviors in a task-agnostic environment. We introduce a novel model-fused paradigm to jointly pre-train the dynamics model and unsupervised exploration policy in the pre-training phase. We show that EUCLID achieves state-of-the-art performance with high sample efficiency.
arXiv Detail & Related papers (2022-10-02T12:11:44Z)
Training and Evaluation of Deep Policies using Reinforcement Learning and Generative Models [67.78935378952146]
GenRL is a framework for solving sequential decision-making problems. It exploits the combination of reinforcement learning and latent variable generative models. We experimentally determine the characteristics of generative models that have most influence on the performance of the final policy training.
arXiv Detail & Related papers (2022-04-18T22:02:32Z)
Imitating, Fast and Slow: Robust learning from demonstrations via decision-time planning [96.72185761508668]
Planning at Test-time (IMPLANT) is a new meta-algorithm for imitation learning. We demonstrate that IMPLANT significantly outperforms benchmark imitation learning approaches on standard control environments.
arXiv Detail & Related papers (2022-04-07T17:16:52Z)
Social NCE: Contrastive Learning of Socially-aware Motion Representations [87.82126838588279]
Experimental results show that the proposed method dramatically reduces the collision rates of recent trajectory forecasting, behavioral cloning and reinforcement learning algorithms. Our method makes few assumptions about neural architecture designs, and hence can be used as a generic way to promote the robustness of neural motion models.
arXiv Detail & Related papers (2020-12-21T22:25:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.