Related papers: Paying Less Generalization Tax: A Cross-Domain Generalization Study of RL Training for LLM Agents

Paying Less Generalization Tax: A Cross-Domain Generalization Study of RL Training for LLM Agents

URL: http://arxiv.org/abs/2601.18217v1
Date: Mon, 26 Jan 2026 07:07:03 GMT
Title: Paying Less Generalization Tax: A Cross-Domain Generalization Study of RL Training for LLM Agents
Authors: Zhihan Liu, Lin Guan, Yixin Nie, Kai Zhang, Zhuoqun Hao, Lin Chen, Asli Celikyilmaz, Zhaoran Wang, Na Zhang,
Abstract summary: Generalist LLM agents are often post-trained on a narrow set of environments but deployed across far broader, unseen domains.<n>In this work, we investigate the challenge of agentic post-training when the eventual test domains are unknown.
Score: 39.70183477067068
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generalist LLM agents are often post-trained on a narrow set of environments but deployed across far broader, unseen domains. In this work, we investigate the challenge of agentic post-training when the eventual test domains are unknown. Specifically, we analyze which properties of reinforcement learning (RL) environments and modeling choices have the greatest influence on out-of-domain performance. First, we identify two environment axes that strongly correlate with cross-domain generalization: (i) state information richness, i.e., the amount of information for the agent to process from the state, and (ii) planning complexity, estimated via goal reachability and trajectory length under a base policy. Notably, domain realism and text-level similarity are not the primary factors; for instance, the simple grid-world domain Sokoban leads to even stronger generalization in SciWorld than the more realistic ALFWorld. Motivated by these findings, we further show that increasing state information richness alone can already effectively improve cross-domain robustness. We propose a randomization technique, which is low-overhead and broadly applicable: add small amounts of distractive goal-irrelevant features to the state to make it richer without altering the task. Beyond environment-side properties, we also examine several modeling choices: (a) SFT warmup or mid-training helps prevent catastrophic forgetting during RL but undermines generalization to domains that are not included in the mid-training datamix; and (b) turning on step-by-step thinking during RL, while not always improving in-domain performance, plays a crucial role in preserving generalization.

Related papers

Sample-Efficient Neurosymbolic Deep Reinforcement Learning [49.60927398960061]
We propose a neuro-symbolic Deep RL approach that integrates background symbolic knowledge to improve sample efficiency.<n>Online reasoning is performed to guide the training process through two mechanisms.<n>We show improved performance over a state-of-the-art reward machine baseline.
arXiv Detail & Related papers (2026-01-06T09:28:53Z)
Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models [71.9060068259379]
We propose cascaded domain-wise reinforcement learning to build general-purpose reasoning models.<n>Our 14B model, after RL, outperforms its SFT teacher, DeepSeek-R1-0528, on LiveCodeBench v5/v6 Pro and silver-medal performance in the 2025 International Olympiad in Informatics (IOI)
arXiv Detail & Related papers (2025-12-15T18:02:35Z)
HFedATM: Hierarchical Federated Domain Generalization via Optimal Transport and Regularized Mean Aggregation [12.655334562608314]
Federated Learning (FL) is a decentralized approach where multiple clients collaboratively train a shared global model without sharing their raw data.<n>This paper introduces Hierarchical Federated Domain Generalization (HFedDG), a novel scenario designed to investigate domain shift within hierarchical architectures.
arXiv Detail & Related papers (2025-08-07T08:14:52Z)
General-Reasoner: Advancing LLM Reasoning Across All Domains [64.70599911897595]
Reinforcement learning (RL) has recently demonstrated strong potential in enhancing the reasoning capabilities of large language models (LLMs)<n>We propose General-Reasoner, a novel training paradigm designed to enhance LLM reasoning capabilities across diverse domains.<n>We train a series of models and evaluate them on a wide range of datasets covering wide domains like physics, chemistry, finance, electronics etc.
arXiv Detail & Related papers (2025-05-20T17:41:33Z)
Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains [92.36624674516553]
Reinforcement learning with verifiable rewards (RLVR) has demonstrated significant success in enhancing mathematical reasoning and coding performance of large language models (LLMs)<n>We investigate the effectiveness and scalability of RLVR across diverse real-world domains including medicine, chemistry, psychology, economics, and education.<n>We utilize a generative scoring technique that yields soft, model-based reward signals to overcome limitations posed by binary verifications.
arXiv Detail & Related papers (2025-03-31T08:22:49Z)
FedDAG: Federated Domain Adversarial Generation Towards Generalizable Medical Image Analysis [13.028776283830686]
Federated Domain Adversarial Generation (FedDAG) aims to simulate the domain shift and improve the model generalization.<n>It generates novel-style images by maximizing the instance-level feature discrepancy between original and generated images.<n>Experiments across four medical benchmarks demonstrate FedDAG's ability to enhance generalization in federated medical scenarios.
arXiv Detail & Related papers (2025-01-22T07:08:45Z)
Exploring Language Model Generalization in Low-Resource Extractive QA [57.14068405860034]
We investigate Extractive Question Answering (EQA) with Large Language Models (LLMs) under domain drift.<n>We devise a series of experiments to explain the performance gap empirically.
arXiv Detail & Related papers (2024-09-27T05:06:43Z)
Beyond Finite Data: Towards Data-free Out-of-distribution Generalization via Extrapolation [19.944946262284123]
Humans can easily extrapolate novel domains, thus, an intriguing question arises: How can neural networks extrapolate like humans and achieve OOD generalization? We introduce a novel approach to domain extrapolation that leverages reasoning ability and the extensive knowledge encapsulated within large language models (LLMs) to synthesize entirely new domains. Our methods exhibit commendable performance in this setting, even surpassing the supervised setting by approximately 1-2% on datasets such as VLCS.
arXiv Detail & Related papers (2024-03-08T18:44:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.