Learning to Answer from Correct Demonstrations
- URL: http://arxiv.org/abs/2510.15464v1
- Date: Fri, 17 Oct 2025 09:20:17 GMT
- Title: Learning to Answer from Correct Demonstrations
- Authors: Nirmit Joshi, Gene Li, Siddharth Bhandari, Shiva Prasad Kasiviswanathan, Cong Ma, Nathan Srebro,
- Abstract summary: We study the problem of learning to generate an answer (or completion) to a question (or prompt), where there could be multiple correct answers.<n>We formalize the problem as offline imitation learning in contextual bandits, with demonstrations from some optimal estimation.<n>Our work motivates looking beyond likelihood when learning from correct demonstrations.
- Score: 26.367368795756903
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We study the problem of learning to generate an answer (or completion) to a question (or prompt), where there could be multiple correct answers, any one of which is acceptable at test time. Learning is based on demonstrations of some correct answer to each training question, as in Supervised Fine Tuning (SFT). We formalize the problem as offline imitation learning in contextual bandits, with demonstrations from some optimal policy, without explicitly observed rewards. Prior work assumes that the demonstrator belongs to a low-complexity policy class, which motivates maximum likelihood estimation (i.e., log-loss minimization). In contrast, we propose relying only on the reward model (specifying which answers are correct) being in a low-cardinality class, which we argue is a weaker assumption. We show that likelihood maximization methods can fail in this case, and instead devise an alternative novel approach that learns with sample complexity logarithmic in the cardinality of the reward class. Our work motivates looking beyond likelihood maximization when learning from correct demonstrations.
Related papers
- Answer-Consistent Chain-of-thought Reinforcement Learning For Multi-modal Large Langauge Models [33.398631680508814]
We propose Answer-Consistent Reinforcement Learning that modifies the GRPO algorithm with an auxiliary consistency check.<n>We design a consistency-verification reward that grants a high reward only if both the original and the post-shuffle answers agree and are correct.<n>We evaluate ACRE on challenging Video Reasoning benchmarks and multimodal math reasoning benchmarks, achieving an average 2.2% and 1.5% improvement.
arXiv Detail & Related papers (2025-10-11T08:32:52Z) - What Can You Do When You Have Zero Rewards During RL? [3.0795668932789515]
Reinforcement learning (RL) with outcome-based rewards has proven effective for improving large language models (LLMs) on complex reasoning tasks.<n>We study this scenario through the graph search task introduced in Bachmann et al. (2024) and evaluate recent methods that incorporate desirable components.<n>We find that a simple data-centric intervention of adding easier samples to the training set enables the model to eventually solve the original hard task despite starting from zero reward.
arXiv Detail & Related papers (2025-10-04T23:10:38Z) - The Majority is not always right: RL training for solution aggregation [53.1050856072799]
We train an aggregator model to review, reconcile, and synthesize a final, correct answer.<n>A key ingredient is careful balancing of easy and hard training examples.<n>We find our method, AggLM, outperforms both strong rule-based and reward-model baselines.
arXiv Detail & Related papers (2025-09-08T16:39:38Z) - Q-Probe: A Lightweight Approach to Reward Maximization for Language Models [16.801981347658625]
We present an approach called Q-probing to adapt a pre-trained language model to maximize a task-specific reward function.
At a high level, Q-probing sits between heavier approaches such as finetuning and lighter approaches such as few shot prompting.
arXiv Detail & Related papers (2024-02-22T16:43:16Z) - Using Foundation Models to Detect Policy Violations with Minimal
Supervision [15.599296461516982]
We seek to leverage foundation models' capabilities to detect policy violations.
We compose the hard-prompts with soft prompt tuning to produce a classifier that attains high accuracy with very little supervision.
We identify several unintuitive aspects of foundation models.
arXiv Detail & Related papers (2023-06-09T20:08:48Z) - Increasing Probability Mass on Answer Choices Does Not Always Improve
Accuracy [60.18632773935895]
Spreading probability mass across multiple surface forms with identical meaning is thought to cause an underestimation of a model's true performance.
We propose a mathematical formalism for SFC which allows us to quantify and bound its impact for the first time.
We identify a simple method for reducing it -- namely, increasing probability mass on the given answer choices by a) including them in the prompt and b) using in-context learning with even just one example.
arXiv Detail & Related papers (2023-05-24T00:27:00Z) - Towards Theoretical Understanding of Inverse Reinforcement Learning [45.3190496371625]
Inverse reinforcement learning (IRL) denotes a powerful family of algorithms for recovering a reward function justifying the behavior demonstrated by an expert agent.
In this paper, we make a step towards closing the theory gap of IRL in the case of finite-horizon problems with a generative model.
arXiv Detail & Related papers (2023-04-25T16:21:10Z) - Chaos is a Ladder: A New Theoretical Understanding of Contrastive
Learning via Augmentation Overlap [64.60460828425502]
We propose a new guarantee on the downstream performance of contrastive learning.
Our new theory hinges on the insight that the support of different intra-class samples will become more overlapped under aggressive data augmentations.
We propose an unsupervised model selection metric ARC that aligns well with downstream accuracy.
arXiv Detail & Related papers (2022-03-25T05:36:26Z) - Reinforcement Learning in Reward-Mixing MDPs [74.41782017817808]
episodic reinforcement learning in a reward-mixing Markov decision process (MDP)
cdot S2 A2)$ episodes, where $H$ is time-horizon and $S, A$ are the number of states and actions respectively.
epsilon$-optimal policy after exploring $tildeO(poly(H,epsilon-1) cdot S2 A2)$ episodes, where $H$ is time-horizon and $S, A$ are the number of states and actions respectively.
arXiv Detail & Related papers (2021-10-07T18:55:49Z) - A Mutual Information Maximization Approach for the Spurious Solution
Problem in Weakly Supervised Question Answering [60.768146126094955]
Weakly supervised question answering usually has only the final answers as supervision signals.
There may exist many spurious solutions that coincidentally derive the correct answer, but training on such solutions can hurt model performance.
We propose to explicitly exploit such semantic correlations by maximizing the mutual information between question-answer pairs and predicted solutions.
arXiv Detail & Related papers (2021-06-14T05:47:41Z) - Replacing Rewards with Examples: Example-Based Policy Search via
Recursive Classification [133.20816939521941]
In the standard Markov decision process formalism, users specify tasks by writing down a reward function.
In many scenarios, the user is unable to describe the task in words or numbers, but can readily provide examples of what the world would look like if the task were solved.
Motivated by this observation, we derive a control algorithm that aims to visit states that have a high probability of leading to successful outcomes, given only examples of successful outcome states.
arXiv Detail & Related papers (2021-03-23T16:19:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.