Related papers: Learning from Imperfect Human Feedback: a Tale from Corruption-Robust Dueling

Learning from Imperfect Human Feedback: a Tale from Corruption-Robust Dueling

URL: http://arxiv.org/abs/2405.11204v2
Date: Mon, 14 Oct 2024 21:55:11 GMT
Title: Learning from Imperfect Human Feedback: a Tale from Corruption-Robust Dueling
Authors: Yuwei Cheng, Fan Yao, Xuefeng Liu, Haifeng Xu,
Abstract summary: This paper studies Learning from Imperfect Human Feedback (LIHF), addressing the potential irrationality or imperfect perception when learning from comparative human feedback. Building on evidences that human's imperfection decays over time (i.e., humans learn to improve), we cast this problem as a concave-utility continuous-action dueling bandit but under a restricted form of corruption. We show how this framework can be easily applied to obtain corruption-robust guarantees for other gradient-based dueling bandit algorithms.
Score: 35.54611331654877
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper studies Learning from Imperfect Human Feedback (LIHF), addressing the potential irrationality or imperfect perception when learning from comparative human feedback. Building on evidences that human's imperfection decays over time (i.e., humans learn to improve), we cast this problem as a concave-utility continuous-action dueling bandit but under a restricted form of corruption: i.e., the corruption scale is decaying over time as $t^{\rho-1}$ for some "imperfection rate" $\rho \in [0, 1]$. With $T$ as the total number of iterations, we establish a regret lower bound of $ \Omega(\max\{\sqrt{T}, T^{\rho}\}) $ for LIHF, even when $\rho$ is known. For the same setting, we develop the Robustified Stochastic Mirror Descent for Imperfect Dueling (RoSMID) algorithm, which achieves nearly optimal regret $\tilde{\mathcal{O}}(\max\{\sqrt{T}, T^{\rho}\})$. Core to our analysis is a novel framework for analyzing gradient-based algorithms for dueling bandit under corruption, and we demonstrate its general applicability by showing how this framework can be easily applied to obtain corruption-robust guarantees for other popular gradient-based dueling bandit algorithms. Our theoretical results are validated by extensive experiments.

Related papers

Nearly Optimal Algorithms for Contextual Dueling Bandits from Adversarial Feedback [58.66941279460248]
Learning from human feedback plays an important role in aligning generative models, such as large language models (LLM) We study a model within this problem domain--contextual dueling bandits with adversarial feedback, where the true preference label can be flipped by an adversary. We propose an algorithm namely robust contextual dueling bandit (algo), which is based on uncertainty-weighted maximum likelihood estimation.
arXiv Detail & Related papers (2024-04-16T17:59:55Z)
Corruption-Robust Offline Reinforcement Learning with General Function Approximation [60.91257031278004]
We investigate the problem of corruption in offline reinforcement learning (RL) with general function approximation. Our goal is to find a policy that is robust to such corruption and minimizes the suboptimality gap with respect to the optimal policy for the uncorrupted Markov decision processes (MDPs)
arXiv Detail & Related papers (2023-10-23T04:07:26Z)
Corruption-Robust Algorithms with Uncertainty Weighting for Nonlinear Contextual Bandits and Markov Decision Processes [59.61248760134937]
We propose an efficient algorithm to achieve a regret of $tildeO(sqrtT+zeta)$. The proposed algorithm relies on the recently developed uncertainty-weighted least-squares regression from linear contextual bandit. We generalize our algorithm to the episodic MDP setting and first achieve an additive dependence on the corruption level $zeta$.
arXiv Detail & Related papers (2022-12-12T15:04:56Z)
Minimax Rates for Robust Community Detection [19.229475414802213]
We study the problem of community detection in the block model with adversarial node corruptions. Our main result is an efficient algorithm that can tolerate an $epsilon$-fraction of corruptions and unbounded error $O(epsilon) + e-fracC2 (1 pm o(1))$ where $C = (sqrta - sqrtb)2$ is the signal-to-noise ratio. We show that our algorithms are doubly-robust in the sense that they work in an even more
arXiv Detail & Related papers (2022-07-25T04:45:16Z)
Nearly Optimal Algorithms for Linear Contextual Bandits with Adversarial Corruptions [98.75618795470524]
We study the linear contextual bandit problem in the presence of adversarial corruption. We propose a new algorithm based on the principle of optimism in the face of uncertainty.
arXiv Detail & Related papers (2022-05-13T17:58:58Z)
A Robust Phased Elimination Algorithm for Corruption-Tolerant Gaussian Process Bandits [118.22458816174144]
We propose a novel robust elimination-type algorithm that runs in epochs, combines exploration with infrequent switching to select a small subset of actions, and plays each action for multiple time instants. Our algorithm, GP Robust Phased Elimination (RGP-PE), successfully balances robustness to corruptions with exploration and exploitation. We perform the first empirical study of robustness in the corrupted GP bandit setting, and show that our algorithm is robust against a variety of adversarial attacks.
arXiv Detail & Related papers (2022-02-03T21:19:36Z)
Linear Contextual Bandits with Adversarial Corruptions [91.38793800392108]
We study the linear contextual bandit problem in the presence of adversarial corruption. We present a variance-aware algorithm that is adaptive to the level of adversarial contamination $C$.
arXiv Detail & Related papers (2021-10-25T02:53:24Z)
On Optimal Robustness to Adversarial Corruption in Online Decision Problems [27.68461396741871]
We show that optimal robustness can be expressed by a square-root dependency on the amount of corruption. For the multi-armed bandit problem, we also provide a nearly tight lower bound up to a logarithmic factor.
arXiv Detail & Related papers (2021-09-22T18:26:45Z)
Improved Corruption Robust Algorithms for Episodic Reinforcement Learning [43.279169081740726]
We study episodic reinforcement learning under unknown adversarial corruptions in both the rewards and the transition probabilities of the underlying system. We propose new algorithms which, compared to the existing results, achieve strictly better regret bounds in terms of total corruptions. Our results follow from a general algorithmic framework that combines corruption-robust policy elimination meta-algorithms, and plug-in reward-free exploration sub-algorithms.
arXiv Detail & Related papers (2021-02-13T07:04:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.