Composite Flow Matching for Reinforcement Learning with Shifted-Dynamics Data
- URL: http://arxiv.org/abs/2505.23062v2
- Date: Wed, 25 Jun 2025 21:09:46 GMT
- Title: Composite Flow Matching for Reinforcement Learning with Shifted-Dynamics Data
- Authors: Lingkai Kong, Haichuan Wang, Tonghan Wang, Guojun Xiong, Milind Tambe,
- Abstract summary: CompFlow is a method grounded in the theoretical connection between flow matching and optimal transport.<n>We model the target dynamics as a conditional flow built upon the output distribution of the source-domain flow.<n>We show that CompFlow outperforms strong baselines across several RL benchmarks with shifted dynamics.
- Score: 33.9944806028575
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Incorporating pre-collected offline data from a source environment can significantly improve the sample efficiency of reinforcement learning (RL), but this benefit is often challenged by discrepancies between the transition dynamics of the source and target environments. Existing methods typically address this issue by penalizing or filtering out source transitions in high dynamics-gap regions. However, their estimation of the dynamics gap often relies on KL divergence or mutual information, which can be ill-defined when the source and target dynamics have disjoint support. To overcome these limitations, we propose CompFlow, a method grounded in the theoretical connection between flow matching and optimal transport. Specifically, we model the target dynamics as a conditional flow built upon the output distribution of the source-domain flow, rather than learning it directly from a Gaussian prior. This composite structure offers two key advantages: (1) improved generalization for learning target dynamics, and (2) a principled estimation of the dynamics gap via the Wasserstein distance between source and target transitions. Leveraging our principled estimation of the dynamics gap, we further introduce an optimistic active data collection strategy that prioritizes exploration in regions of high dynamics gap, and theoretically prove that it reduces the performance disparity with the optimal policy. Empirically, CompFlow outperforms strong baselines across several RL benchmarks with shifted dynamics.
Related papers
- Fast State-Augmented Learning for Wireless Resource Allocation with Dual Variable Regression [83.27791109672927]
We show how a state-augmented graph neural network (GNN) parametrization for the resource allocation policy circumvents the drawbacks of the ubiquitous dual subgradient methods.<n>Lagrangian maximizing state-augmented policies are learned during the offline training phase.<n>We prove a convergence result and an exponential probability bound on the excursions of the dual function (iterate) optimality gaps.
arXiv Detail & Related papers (2025-06-23T15:20:58Z) - MOBODY: Model Based Off-Dynamics Offline Reinforcement Learning [15.474954145228518]
We study the off-dynamics offline reinforcement learning problem, where the goal is to learn a policy from offline datasets with mismatched transition.<n>We propose MOBODY, a Model-Based Off-Dynamics offline RL algorithm that enables exploration of the target domain via learned dynamics.<n>We evaluate MOBODY on MuJoCo benchmarks and show that it significantly outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2025-06-10T05:36:54Z) - Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining [55.262510814326035]
Existing reweighting strategies primarily focus on group-level data importance.<n>We introduce novel algorithms for dynamic, instance-level data reweighting.<n>Our framework allows us to devise reweighting strategies deprioritizing redundant or uninformative data.
arXiv Detail & Related papers (2025-02-10T17:57:15Z) - Off-dynamics Conditional Diffusion Planners [15.321049697197447]
This work explores the use of more readily available, albeit off-dynamics datasets, to address the challenge of data scarcity in Offline RL.
We propose a novel approach using conditional Diffusion Probabilistic Models (DPMs) to learn the joint distribution of the large-scale off-dynamics dataset and the limited target dataset.
arXiv Detail & Related papers (2024-10-16T04:56:43Z) - Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC.<n>We increase the consistency and informativeness of the pairwise preference signals through targeted modifications.<n>We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z) - Online Boosting Adaptive Learning under Concept Drift for Multistream
Classification [34.64751041290346]
Multistream classification poses significant challenges due to the necessity for rapid adaptation in dynamic streaming processes with concept drift.
We propose a novel Online Boosting Adaptive Learning (OBAL) method that adaptively learns the dynamic correlation among different streams.
arXiv Detail & Related papers (2023-12-17T23:10:39Z) - Learning to Continuously Optimize Wireless Resource in a Dynamic
Environment: A Bilevel Optimization Perspective [52.497514255040514]
This work develops a new approach that enables data-driven methods to continuously learn and optimize resource allocation strategies in a dynamic environment.
We propose to build the notion of continual learning into wireless system design, so that the learning model can incrementally adapt to the new episodes.
Our design is based on a novel bilevel optimization formulation which ensures certain fairness" across different data samples.
arXiv Detail & Related papers (2021-05-03T07:23:39Z) - Latent-Optimized Adversarial Neural Transfer for Sarcasm Detection [50.29565896287595]
We apply transfer learning to exploit common datasets for sarcasm detection.
We propose a generalized latent optimization strategy that allows different losses to accommodate each other.
In particular, we achieve 10.02% absolute performance gain over the previous state of the art on the iSarcasm dataset.
arXiv Detail & Related papers (2021-04-19T13:07:52Z) - An Ode to an ODE [78.97367880223254]
We present a new paradigm for Neural ODE algorithms, called ODEtoODE, where time-dependent parameters of the main flow evolve according to a matrix flow on the group O(d)
This nested system of two flows provides stability and effectiveness of training and provably solves the gradient vanishing-explosion problem.
arXiv Detail & Related papers (2020-06-19T22:05:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.