Semi-supervised Batch Learning From Logged Data
- URL: http://arxiv.org/abs/2209.07148v3
- Date: Sun, 18 Feb 2024 15:26:01 GMT
- Title: Semi-supervised Batch Learning From Logged Data
- Authors: Gholamali Aminian, Armin Behnamnia, Roberto Vega, Laura Toni,
Chengchun Shi, Hamid R. Rabiee, Omar Rivasplata, Miguel R. D. Rodrigues
- Abstract summary: We build on the counterfactual risk minimization framework, which also assumes access to propensity scores.
We propose learning methods for problems where feedback is missing for some samples, so there are samples with feedback and samples missing-feedback in the logged data.
- Score: 24.826544828460158
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Off-policy learning methods are intended to learn a policy from logged data,
which includes context, action, and feedback (cost or reward) for each sample
point. In this work, we build on the counterfactual risk minimization
framework, which also assumes access to propensity scores. We propose learning
methods for problems where feedback is missing for some samples, so there are
samples with feedback and samples missing-feedback in the logged data. We refer
to this type of learning as semi-supervised batch learning from logged data,
which arises in a wide range of application domains. We derive a novel upper
bound for the true risk under the inverse propensity score estimator to address
this kind of learning problem. Using this bound, we propose a regularized
semi-supervised batch learning method with logged data where the regularization
term is feedback-independent and, as a result, can be evaluated using the
logged missing-feedback data. Consequently, even though feedback is only
present for some samples, a learning policy can be learned by leveraging the
missing-feedback samples. The results of experiments derived from benchmark
datasets indicate that these algorithms achieve policies with better
performance in comparison with logging policies.
Related papers
- Downstream-Pretext Domain Knowledge Traceback for Active Learning [138.02530777915362]
We propose a downstream-pretext domain knowledge traceback (DOKT) method that traces the data interactions of downstream knowledge and pre-training guidance.
DOKT consists of a traceback diversity indicator and a domain-based uncertainty estimator.
Experiments conducted on ten datasets show that our model outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2024-07-20T01:34:13Z) - Enhancing Consistency and Mitigating Bias: A Data Replay Approach for
Incremental Learning [100.7407460674153]
Deep learning systems are prone to catastrophic forgetting when learning from a sequence of tasks.
To mitigate the problem, a line of methods propose to replay the data of experienced tasks when learning new tasks.
However, it is not expected in practice considering the memory constraint or data privacy issue.
As a replacement, data-free data replay methods are proposed by inverting samples from the classification model.
arXiv Detail & Related papers (2024-01-12T12:51:12Z) - Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF [10.43364672415871]
In practice, preference learning from human feedback depends on incomplete data with hidden context.
We show that standard applications of preference learning, including reinforcement learning from human feedback, implicitly aggregate over hidden contexts.
We introduce a class of methods called distributional preference learning (DPL) to better account for hidden context.
arXiv Detail & Related papers (2023-12-13T18:51:34Z) - On-Policy Policy Gradient Reinforcement Learning Without On-Policy Sampling [3.5253513747455303]
We introduce an adaptive, off-policy sampling method to improve the data efficiency of on-policy policy gradient algorithms.
Our method, Proximal Robust On-Policy Sampling (PROPS), reduces sampling error by collecting data with a behavior policy.
arXiv Detail & Related papers (2023-11-14T16:37:28Z) - Solving Inverse Problems with Score-Based Generative Priors learned from
Noisy Data [1.7969777786551424]
SURE-Score is an approach for learning score-based generative models using training samples corrupted by additive Gaussian noise.
We demonstrate the generality of SURE-Score by learning priors and applying posterior sampling to ill-posed inverse problems in two practical applications.
arXiv Detail & Related papers (2023-05-02T02:51:01Z) - Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning.
Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z) - Unsupervised Few-shot Learning via Deep Laplacian Eigenmaps [13.6555672824229]
We present an unsupervised few-shot learning method via deep Laplacian eigenmaps.
Our method learns representation from unlabeled data by grouping similar samples together.
We analytically show how deep Laplacian eigenmaps avoid collapsed representation in unsupervised learning.
arXiv Detail & Related papers (2022-10-07T14:53:03Z) - Simulating Bandit Learning from User Feedback for Extractive Question
Answering [51.97943858898579]
We study learning from user feedback for extractive question answering by simulating feedback using supervised data.
We show that systems initially trained on a small number of examples can dramatically improve given feedback from users on model-predicted answers.
arXiv Detail & Related papers (2022-03-18T17:47:58Z) - Variance-Optimal Augmentation Logging for Counterfactual Evaluation in
Contextual Bandits [25.153656462604268]
Methods for offline A/B testing and counterfactual learning are seeing rapid adoption in search and recommender systems.
The counterfactual estimators that are commonly used in these methods can have large bias and large variance when the logging policy is very different from the target policy being evaluated.
This paper introduces Minimum Variance Augmentation Logging (MVAL), a method for constructing logging policies that minimize the variance of the downstream evaluation or learning problem.
arXiv Detail & Related papers (2022-02-03T17:37:11Z) - DEALIO: Data-Efficient Adversarial Learning for Imitation from
Observation [57.358212277226315]
In imitation learning from observation IfO, a learning agent seeks to imitate a demonstrating agent using only observations of the demonstrated behavior without access to the control signals generated by the demonstrator.
Recent methods based on adversarial imitation learning have led to state-of-the-art performance on IfO problems, but they typically suffer from high sample complexity due to a reliance on data-inefficient, model-free reinforcement learning algorithms.
This issue makes them impractical to deploy in real-world settings, where gathering samples can incur high costs in terms of time, energy, and risk.
We propose a more data-efficient IfO algorithm
arXiv Detail & Related papers (2021-03-31T23:46:32Z) - Proposal Learning for Semi-Supervised Object Detection [76.83284279733722]
It is non-trivial to train object detectors on unlabeled data due to the unavailability of ground truth labels.
We present a proposal learning approach to learn proposal features and predictions from both labeled and unlabeled data.
arXiv Detail & Related papers (2020-01-15T00:06:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.