TREND: Tri-teaching for Robust Preference-based Reinforcement Learning with Demonstrations
- URL: http://arxiv.org/abs/2505.06079v1
- Date: Fri, 09 May 2025 14:22:43 GMT
- Title: TREND: Tri-teaching for Robust Preference-based Reinforcement Learning with Demonstrations
- Authors: Shuaiyi Huang, Mara Levy, Anubhav Gupta, Daniel Ekpo, Ruijie Zheng, Abhinav Shrivastava,
- Abstract summary: TREND is a novel framework that integrates few-shot expert demonstrations with a tri-teaching strategy for effective noise mitigation.<n>We evaluate TREND on various robotic manipulation tasks, achieving up to 90% success rates even with noise levels as high as 40%.
- Score: 35.403773147399185
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Preference feedback collected by human or VLM annotators is often noisy, presenting a significant challenge for preference-based reinforcement learning that relies on accurate preference labels. To address this challenge, we propose TREND, a novel framework that integrates few-shot expert demonstrations with a tri-teaching strategy for effective noise mitigation. Our method trains three reward models simultaneously, where each model views its small-loss preference pairs as useful knowledge and teaches such useful pairs to its peer network for updating the parameters. Remarkably, our approach requires as few as one to three expert demonstrations to achieve high performance. We evaluate TREND on various robotic manipulation tasks, achieving up to 90% success rates even with noise levels as high as 40%, highlighting its effective robustness in handling noisy preference feedback. Project page: https://shuaiyihuang.github.io/publications/TREND.
Related papers
- Robust Noisy Correspondence Learning via Self-Drop and Dual-Weight [11.523154025649758]
Crowd-sourcing or web crawling introduces mismatched pairs.<n>Current approaches leverage the effect of deep neural networks to distinguish noise and perform re-weighting.<n>We propose a novel self-drop and dual-weight approach, which achieves elaborate data processing by qua- Partitioning the data.
arXiv Detail & Related papers (2024-12-09T03:06:10Z) - Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs [73.74375912785689]
This paper proposes unified training strategies for speech recognition systems.
We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance.
We also introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples.
arXiv Detail & Related papers (2024-11-04T16:46:53Z) - Large Language Model Enhanced Hard Sample Identification for Denoising Recommendation [4.297249011611168]
Implicit feedback is often used to build recommender systems.
Previous studies have attempted to alleviate this by identifying noisy samples based on their diverged patterns.
We propose a Large Language Model Enhanced Hard Sample Denoising framework.
arXiv Detail & Related papers (2024-09-16T14:57:09Z) - Dual Test-time Training for Out-of-distribution Recommender System [91.15209066874694]
We propose a novel Dual Test-Time-Training framework for OOD Recommendation, termed DT3OR.<n>In DT3OR, we incorporate a model adaptation mechanism during the test-time phase to carefully update the recommendation model.<n>To the best of our knowledge, this paper is the first work to address OOD recommendation via a test-time-training strategy.
arXiv Detail & Related papers (2024-07-22T13:27:51Z) - ROPO: Robust Preference Optimization for Large Language Models [59.10763211091664]
We propose an iterative alignment approach that integrates noise-tolerance and filtering of noisy samples without the aid of external models.
Experiments on three widely-used datasets with Mistral-7B and Llama-2-7B demonstrate that ROPO significantly outperforms existing preference alignment methods.
arXiv Detail & Related papers (2024-04-05T13:58:51Z) - One-Shot Learning as Instruction Data Prospector for Large Language Models [108.81681547472138]
textscNuggets uses one-shot learning to select high-quality instruction data from extensive datasets.
We show that instruction tuning with the top 1% of examples curated by textscNuggets substantially outperforms conventional methods employing the entire dataset.
arXiv Detail & Related papers (2023-12-16T03:33:12Z) - Learning Robust Recommender from Noisy Implicit Feedback [140.7090392887355]
We propose a new training strategy named Adaptive Denoising Training (ADT)
ADT adaptively prunes the noisy interactions by two paradigms (i.e., Truncated Loss and Reweighted Loss)
We consider extra feedback (e.g., rating) as auxiliary signal and propose three strategies to incorporate extra feedback into ADT.
arXiv Detail & Related papers (2021-12-02T12:12:02Z) - Towards Sample-efficient Apprenticeship Learning from Suboptimal
Demonstration [1.6114012813668934]
We present Systematic Self-Supervised Reward Regression, S3RR, to investigate systematic alternatives for trajectory degradation.
We find S3RR can learn comparable or better reward correlation with ground-truth against a state-of-the-art learning from suboptimal demonstration framework.
arXiv Detail & Related papers (2021-10-08T19:15:32Z) - Noisy Self-Knowledge Distillation for Text Summarization [83.49809205891496]
We apply self-knowledge distillation to text summarization which we argue can alleviate problems with maximum-likelihood training.
Our student summarization model is trained with guidance from a teacher which generates smoothed labels to help regularize training.
We demonstrate experimentally on three benchmarks that our framework boosts the performance of both pretrained and non-pretrained summarizers.
arXiv Detail & Related papers (2020-09-15T12:53:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.