Learning to Expand: Reinforced Pseudo-relevance Feedback Selection for
Information-seeking Conversations
- URL: http://arxiv.org/abs/2011.12771v1
- Date: Wed, 25 Nov 2020 14:33:18 GMT
- Title: Learning to Expand: Reinforced Pseudo-relevance Feedback Selection for
Information-seeking Conversations
- Authors: Haojie Pan, Cen Chen, Minghui Qiu, Liu Yang, Feng Ji, Jun Huang,
Haiqing Chen
- Abstract summary: We treat the PRF selection as a learning task and proposed a reinforced learning based method that can be trained in an end-to-end manner without any human annotations.
Our model can not only select meaningful PRF terms to expand response candidates but also achieve the best results compared with all the baseline methods on a variety of evaluation metrics.
- Score: 47.43989857297574
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Intelligent personal assistant systems for information-seeking conversations
are increasingly popular in real-world applications, especially for e-commerce
companies. With the development of research in such conversation systems, the
pseudo-relevance feedback (PRF) has demonstrated its effectiveness in
incorporating relevance signals from external documents. However, the existing
studies are either based on heuristic rules or require heavy manual labeling.
In this work, we treat the PRF selection as a learning task and proposed a
reinforced learning based method that can be trained in an end-to-end manner
without any human annotations. More specifically, we proposed a reinforced
selector to extract useful PRF terms to enhance response candidates and a BERT
based response ranker to rank the PRF-enhanced responses. The performance of
the ranker serves as rewards to guide the selector to extract useful PRF terms,
and thus boost the task performance. Extensive experiments on both standard
benchmark and commercial datasets show the superiority of our reinforced PRF
term selector compared with other potential soft or hard selection methods.
Both qualitative case studies and quantitative analysis show that our model can
not only select meaningful PRF terms to expand response candidates but also
achieve the best results compared with all the baseline methods on a variety of
evaluation metrics. We have also deployed our method on online production in an
e-commerce company, which shows a significant improvement over the existing
online ranking system.
Related papers
- Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [56.24431208419858]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset.
We propose an effective yet simple data relabeling method that conditions the preference pairs on quality scores to construct a reward-augmented dataset.
arXiv Detail & Related papers (2024-10-10T16:01:51Z) - OPTune: Efficient Online Preference Tuning [107.44836901099]
We propose a more efficient data exploration strategy for online preference tuning (OPTune)
OPTune dynamically samples informative responses for on-policy preference alignment.
In our evaluations, OPTune'd LLMs enjoy 1.27-1.56x faster training speed due to the efficient data exploration strategy.
arXiv Detail & Related papers (2024-06-11T18:55:04Z) - Online Self-Preferring Language Models [34.22412851864247]
Online Self-Preferring (OSP) language models learn from self-generated response pairs and self-judged preference strengths.
OSP achieves state-of-the-art alignment performance across various metrics in two widely used human preference datasets.
arXiv Detail & Related papers (2024-05-23T02:13:34Z) - Aligning Large Language Models by On-Policy Self-Judgment [49.31895979525054]
Existing approaches for aligning large language models with human preferences face a trade-off that requires a separate reward model (RM) for on-policy learning.
We present a novel alignment framework, SELF-JUDGE, that does on-policy learning and is parameter efficient.
We show that the rejecting sampling by itself can improve performance further without an additional evaluator.
arXiv Detail & Related papers (2024-02-17T11:25:26Z) - Reinforcement Replaces Supervision: Query focused Summarization using
Deep Reinforcement Learning [43.123290672073814]
We deal with systems that generate summaries from document(s) based on a query.
Motivated by the insight that Reinforcement Learning (RL) provides a generalization to Supervised Learning (SL) for Natural Language Generation, we use an RL-based approach for this task.
We develop multiple Policy Gradient networks, trained on various reward signals: ROUGE, BLEU, and Semantic Similarity.
arXiv Detail & Related papers (2023-11-29T10:38:16Z) - Reinforcement Learning from Statistical Feedback: the Journey from AB
Testing to ANT Testing [1.1142354615369272]
Reinforcement Learning from Human Feedback (RLHF) has played a crucial role in the success of large models such as ChatGPT.
We will attempt to fill this gap with statistical business feedback instead of human feedback, using AB testing.
Statistical inference methods are used to obtain preferences for training the reward network, which fine-tunes the pre-trained model.
arXiv Detail & Related papers (2023-11-24T07:50:52Z) - SURF: Semi-supervised Reward Learning with Data Augmentation for
Feedback-efficient Preference-based Reinforcement Learning [168.89470249446023]
We present SURF, a semi-supervised reward learning framework that utilizes a large amount of unlabeled samples with data augmentation.
In order to leverage unlabeled samples for reward learning, we infer pseudo-labels of the unlabeled samples based on the confidence of the preference predictor.
Our experiments demonstrate that our approach significantly improves the feedback-efficiency of the preference-based method on a variety of locomotion and robotic manipulation tasks.
arXiv Detail & Related papers (2022-03-18T16:50:38Z) - Information Directed Reward Learning for Reinforcement Learning [64.33774245655401]
We learn a model of the reward function that allows standard RL algorithms to achieve high expected return with as few expert queries as possible.
In contrast to prior active reward learning methods designed for specific types of queries, IDRL naturally accommodates different query types.
We support our findings with extensive evaluations in multiple environments and with different types of queries.
arXiv Detail & Related papers (2021-02-24T18:46:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.