Walking the Tightrope: Disentangling Beneficial and Detrimental Drifts in Non-Stationary Custom-Tuning
- URL: http://arxiv.org/abs/2505.13081v1
- Date: Mon, 19 May 2025 13:13:38 GMT
- Title: Walking the Tightrope: Disentangling Beneficial and Detrimental Drifts in Non-Stationary Custom-Tuning
- Authors: Xiaoyu Yang, Jie Lu, En Yu,
- Abstract summary: This paper uncovers a critical yet overlooked phenomenon in multi-modal large language models (MLLMs)<n>We are pioneers in establishing the theoretical bridge between concept drift theory and RFT processes.<n>We propose a novel counterfact-aware RFT that systematically decouples beneficial distribution adaptation from harmful concept drift.
- Score: 16.97188816362991
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper uncovers a critical yet overlooked phenomenon in multi-modal large language models (MLLMs): detrimental concept drift within chain-of-thought (CoT) reasoning during non-stationary reinforcement fine-tuning (RFT), where reasoning token distributions evolve unpredictably, thereby introducing significant biases in final predictions. To address this, we are pioneers in establishing the theoretical bridge between concept drift theory and RFT processes by formalizing CoT's autoregressive token streams as non-stationary distributions undergoing arbitrary temporal shifts. Leveraging this framework, we propose a novel counterfact-aware RFT that systematically decouples beneficial distribution adaptation from harmful concept drift through concept graph-empowered LLM experts generating counterfactual reasoning trajectories. Our solution, Counterfactual Preference Optimization (CPO), enables stable RFT in non-stationary environments, particularly within the medical domain, through custom-tuning of counterfactual-aware preference alignment. Extensive experiments demonstrate our superior performance of robustness, generalization and coordination within RFT. Besides, we also contributed a large-scale dataset CXR-CounterFact (CCF), comprising 320,416 meticulously curated counterfactual reasoning trajectories derived from MIMIC-CXR. Our code and data are public.
Related papers
- Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections [65.36449542323277]
We present a unified theoretical framework bridgingSupervised Fine-Tuning (SFT) and preference learning in Large Language Model (LLM) post-training.<n>We propose a simple yet effective learning rate reduction approach that yields significant performance improvements.
arXiv Detail & Related papers (2025-06-15T05:42:29Z) - Beyond Patterns: Harnessing Causal Logic for Autonomous Driving Trajectory Prediction [10.21659221112514]
We introduce a novel trajectory prediction framework that leverages causal inference to enhance predictive robustness, generalization, and accuracy.<n>Our findings highlight the potential of causal reasoning to transform trajectory prediction, paving the way for robust autonomous driving systems.
arXiv Detail & Related papers (2025-05-11T05:56:07Z) - Rolling with the Punches: Resilient Contrastive Pre-training under Non-Stationary Drift [16.97188816362991]
A critical emerging challenge is the effective pre-training of models on dynamic data streams.<n>We first reveal that conventional contrastive pre-training methods are notably vulnerable to concept drift.<n>We propose Resilient Contrastive Pre-training (RCP), a novel method incorporating causal intervention.
arXiv Detail & Related papers (2025-02-11T15:09:05Z) - Rectified Diffusion Guidance for Conditional Generation [62.00207951161297]
We revisit the theory behind CFG and rigorously confirm that the improper configuration of the combination coefficients (i.e., the widely used summing-to-one version) brings about expectation shift of the generative distribution.
We propose ReCFG with a relaxation on the guidance coefficients such that denoising with ReCFG strictly aligns with the diffusion theory.
That way the rectified coefficients can be readily pre-computed via traversing the observed data, leaving the sampling speed barely affected.
arXiv Detail & Related papers (2024-10-24T13:41:32Z) - Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint [56.74058752955209]
This paper studies the alignment process of generative models with Reinforcement Learning from Human Feedback (RLHF)
We first identify the primary challenges of existing popular methods like offline PPO and offline DPO as lacking in strategical exploration of the environment.
We propose efficient algorithms with finite-sample theoretical guarantees.
arXiv Detail & Related papers (2023-12-18T18:58:42Z) - Boosted Control Functions: Distribution generalization and invariance in confounded models [10.503777692702952]
We introduce a strong notion of invariance that allows for distribution generalization even in the presence of nonlinear, non-identifiable structural functions.<n>We propose the ControlTwicing algorithm to estimate the Boosted Control Function (BCF) using flexible machine-learning techniques.
arXiv Detail & Related papers (2023-10-09T15:43:46Z) - SOFARI: High-Dimensional Manifold-Based Inference [8.860162863559163]
We introduce two SOFARI variants to handle strongly and weakly latent factors, where the latter covers a broader range of applications.
We show that SOFARI provides bias-corrected estimators for both latent left factor vectors and singular values, for which we show to enjoy the mean-zero normal distributions with sparse estimable variances.
We illustrate the effectiveness of SOFARI and justify our theoretical results through simulation examples and a real data application in economic forecasting.
arXiv Detail & Related papers (2023-09-26T16:01:54Z) - Advancing Counterfactual Inference through Nonlinear Quantile Regression [77.28323341329461]
We propose a framework for efficient and effective counterfactual inference implemented with neural networks.
The proposed approach enhances the capacity to generalize estimated counterfactual outcomes to unseen data.
Empirical results conducted on multiple datasets offer compelling support for our theoretical assertions.
arXiv Detail & Related papers (2023-06-09T08:30:51Z) - Federated Conformal Predictors for Distributed Uncertainty
Quantification [83.50609351513886]
Conformal prediction is emerging as a popular paradigm for providing rigorous uncertainty quantification in machine learning.
In this paper, we extend conformal prediction to the federated learning setting.
We propose a weaker notion of partial exchangeability, better suited to the FL setting, and use it to develop the Federated Conformal Prediction framework.
arXiv Detail & Related papers (2023-05-27T19:57:27Z) - Toward Certified Robustness Against Real-World Distribution Shifts [65.66374339500025]
We train a generative model to learn perturbations from data and define specifications with respect to the output of the learned model.
A unique challenge arising from this setting is that existing verifiers cannot tightly approximate sigmoid activations.
We propose a general meta-algorithm for handling sigmoid activations which leverages classical notions of counter-example-guided abstraction refinement.
arXiv Detail & Related papers (2022-06-08T04:09:13Z) - Efficient CDF Approximations for Normalizing Flows [64.60846767084877]
We build upon the diffeomorphic properties of normalizing flows to estimate the cumulative distribution function (CDF) over a closed region.
Our experiments on popular flow architectures and UCI datasets show a marked improvement in sample efficiency as compared to traditional estimators.
arXiv Detail & Related papers (2022-02-23T06:11:49Z) - Residual Pathway Priors for Soft Equivariance Constraints [44.19582621065543]
We introduce Residual Pathway Priors (RPPs) as a method for converting hard architectural constraints into soft priors.
RPPs are resilient to approximate or misspecified symmetries, and are as effective as fully constrained models even when symmetries are exact.
arXiv Detail & Related papers (2021-12-02T16:18:17Z) - False Correlation Reduction for Offline Reinforcement Learning [115.11954432080749]
We propose falSe COrrelation REduction (SCORE) for offline RL, a practically effective and theoretically provable algorithm.
We empirically show that SCORE achieves the SoTA performance with 3.1x acceleration on various tasks in a standard benchmark (D4RL)
arXiv Detail & Related papers (2021-10-24T15:34:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.