ReSURE: Regularizing Supervision Unreliability for Multi-turn Dialogue Fine-tuning
- URL: http://arxiv.org/abs/2508.19996v1
- Date: Wed, 27 Aug 2025 15:54:01 GMT
- Title: ReSURE: Regularizing Supervision Unreliability for Multi-turn Dialogue Fine-tuning
- Authors: Yiming Du, Yifan Xiang, Bin Liang, Dahua Lin, Kam-Fai Wong, Fei Tan,
- Abstract summary: Multi-turn dialogue systems often suffer from degraded performance when exposed to low-quality data.<n>We propose ReSURE, an adaptive learning method that dynamically down-weights unreliable supervision without explicit filtering.<n>Experiments on both single-source and mixed-quality datasets show improved stability and response quality.
- Score: 72.05731026796335
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Fine-tuning multi-turn dialogue systems requires high-quality supervision but often suffers from degraded performance when exposed to low-quality data. Supervision errors in early turns can propagate across subsequent turns, undermining coherence and response quality. Existing methods typically address data quality via static prefiltering, which decouples quality control from training and fails to mitigate turn-level error propagation. In this context, we propose ReSURE (Regularizing Supervision UnREliability), an adaptive learning method that dynamically down-weights unreliable supervision without explicit filtering. ReSURE estimates per-turn loss distributions using Welford's online statistics and reweights sample losses on the fly accordingly. Experiments on both single-source and mixed-quality datasets show improved stability and response quality. Notably, ReSURE enjoys positive Spearman correlations (0.21 ~ 1.0 across multiple benchmarks) between response scores and number of samples regardless of data quality, which potentially paves the way for utilizing large-scale data effectively. Code is publicly available at https://github.com/Elvin-Yiming-Du/ReSURE_Multi_Turn_Training.
Related papers
- CARE What Fails: Contrastive Anchored-REflection for Verifiable Multimodal [84.71254539482369]
Group-relative reinforcement learning with verifiable rewards (RLVR) often wastes the most informative data it already has the failures.<n>We present CARE, a failure-centric post-training framework for multimodal reasoning that turns errors into supervision.<n> CARE improves accuracy and training smoothness while explicitly increasing the share of learning signal that comes from failures.
arXiv Detail & Related papers (2025-12-22T16:34:21Z) - Continual Action Quality Assessment via Adaptive Manifold-Aligned Graph Regularization [53.82400605816587]
Action Quality Assessment (AQA) quantifies human actions in videos, supporting applications in sports scoring, rehabilitation, and skill evaluation.<n>A major challenge lies in the non-stationary nature of quality distributions in real-world scenarios.<n>We introduce Continual AQA (CAQA), which equips with Continual Learning capabilities to handle evolving distributions.
arXiv Detail & Related papers (2025-10-08T10:09:47Z) - Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers [90.50039419576807]
Reinforcement Learning with Verifiable Rewards (RLVR) trains policies against automated verifiers to avoid costly human labeling.<n>To reduce vulnerability to verifier hacking, many RLVR systems collapse rewards to binary $0,1$ during training.<n>This choice carries a cost: it introduces textitfalse negatives (rejecting correct answers, FNs) and textitfalse positives (accepting incorrect ones, FPs)
arXiv Detail & Related papers (2025-10-01T13:56:44Z) - ReTrack: Data Unlearning in Diffusion Models through Redirecting the Denoising Trajectory [17.016094185289372]
We propose ReTrack, a fast and effective data unlearning method for diffusion models.<n>ReTrack employs importance sampling to construct a more efficient fine-tuning loss.<n>Experiments on MNIST T-Shirt, CelebA-HQ, CIFAR-10, and Stable Diffusion show that ReTrack achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-09-16T12:20:15Z) - Counterfactual Reward Model Training for Bias Mitigation in Multimodal Reinforcement Learning [0.5204229323525671]
We present a counterfactual reward model that introduces causal inference with multimodal representation learning to provide an unsupervised, bias-resilient reward signal.<n>We evaluated the framework on a multimodal fake versus true news dataset, which exhibits framing bias, class imbalance, and distributional drift.<n>The resulting system achieved an accuracy of 89.12% in fake news detection, outperforming the baseline reward models.
arXiv Detail & Related papers (2025-08-27T04:54:33Z) - SCIZOR: A Self-Supervised Approach to Data Curation for Large-Scale Imitation Learning [30.34323856102674]
Imitation learning advances robot capabilities by enabling the acquisition of diverse behaviors from human demonstrations.<n>Existing robotic curation approaches rely on costly manual annotations and perform curation at a coarse granularity.<n>We introduce SCIZOR, a self-supervised data curation framework that filters out low-quality state-action pairs to improve the performance of imitation learning policies.
arXiv Detail & Related papers (2025-05-28T17:45:05Z) - PEEL the Layers and Find Yourself: Revisiting Inference-time Data Leakage for Residual Neural Networks [64.90981115460937]
This paper explores inference-time data leakage risks of deep neural networks (NNs)<n>We propose a novel backward feature inversion method, textbfPEEL, which can effectively recover block-wise input features from the intermediate output of residual NNs.<n>Our results show that PEEL outperforms the state-of-the-art recovery methods by an order of magnitude when evaluated by mean squared error (MSE)
arXiv Detail & Related papers (2025-04-08T20:11:05Z) - Bridging SFT and DPO for Diffusion Model Alignment with Self-Sampling Preference Optimization [67.8738082040299]
Self-Sampling Preference Optimization (SSPO) is a new alignment method for post-training reinforcement learning.<n>SSPO eliminates the need for paired data and reward models while retaining the training stability of SFT.<n>SSPO surpasses all previous approaches on the text-to-image benchmarks and demonstrates outstanding performance on the text-to-video benchmarks.
arXiv Detail & Related papers (2024-10-07T17:56:53Z) - Leveraging Latent Diffusion Models for Training-Free In-Distribution Data Augmentation for Surface Defect Detection [9.784793380119806]
We introduce DIAG, a training-free Diffusion-based In-distribution Anomaly Generation pipeline for data augmentation.
Unlike conventional image generation techniques, we implement a human-in-the-loop pipeline, where domain experts provide multimodal guidance to the model.
We demonstrate the efficacy and versatility of DIAG with respect to state-of-the-art data augmentation approaches on the challenging KSDD2 dataset.
arXiv Detail & Related papers (2024-07-04T14:28:52Z) - Enhancing Consistency and Mitigating Bias: A Data Replay Approach for Incremental Learning [93.90047628101155]
Deep learning systems are prone to catastrophic forgetting when learning from a sequence of tasks.<n>To address this, some methods propose replaying data from previous tasks during new task learning.<n>However, it is not expected in practice due to memory constraints and data privacy issues.
arXiv Detail & Related papers (2024-01-12T12:51:12Z) - Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice.
We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z) - Continual Learning for Fake Audio Detection [62.54860236190694]
This paper proposes detecting fake without forgetting, a continual-learning-based method, to make the model learn new spoofing attacks incrementally.
Experiments are conducted on the ASVspoof 2019 dataset.
arXiv Detail & Related papers (2021-04-15T07:57:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.