Reinforcement learning for question answering in programming domain
using public community scoring as a human feedback
- URL: http://arxiv.org/abs/2401.10882v1
- Date: Fri, 19 Jan 2024 18:49:36 GMT
- Title: Reinforcement learning for question answering in programming domain
using public community scoring as a human feedback
- Authors: Alexey Gorbatovski and Sergey Kovalchuk
- Abstract summary: We investigate the enhancement of the GPT Neo 125M performance in Community Question Answering (CQA) with a focus on programming.
Two distinct reward model training strategies are employed for fine-tuning with Proximal Policy Optimization (PPO)
An auxiliary scoring mechanism is introduced, which demonstrates the limitations of conventional linguistic metrics in evaluating responses in the programming domain.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this study, we investigate the enhancement of the GPT Neo 125M performance
in Community Question Answering (CQA) with a focus on programming, through the
integration of Reinforcement Learning from Human Feedback (RLHF) and the
utilization of scores from Stack Overflow. Two distinct reward model training
strategies are employed for fine-tuning with Proximal Policy Optimization
(PPO). Notably, the improvements in performance achieved through this method
are comparable to those of GPT Neo 2.7B parameter variant. Additionally, an
auxiliary scoring mechanism is introduced, which demonstrates the limitations
of conventional linguistic metrics in evaluating responses in the programming
domain. Through accurate analysis, this paper looks at the divergence between
traditional linguistic metrics and our human-preferences-based reward model,
underscoring the imperative for domain-specific evaluation methods. By
elucidating the complexities involved in applying RLHF to programming CQA and
accentuating the significance of context-aware evaluation, this study
contributes to the ongoing efforts in refining Large Language Models through
focused human feedback.
Related papers
- NLP and Education: using semantic similarity to evaluate filled gaps in a large-scale Cloze test in the classroom [0.0]
Using data from Cloze tests administered to students in Brazil, WE models for Brazilian Portuguese (PT-BR) were employed to measure semantic similarity.
A comparative analysis between the WE models' scores and the judges' evaluations revealed that GloVe was the most effective model.
arXiv Detail & Related papers (2024-11-02T15:22:26Z) - RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs [49.386699863989335]
Training large language models (LLMs) to serve as effective assistants for humans requires careful consideration.
A promising approach is reinforcement learning from human feedback (RLHF), which leverages human feedback to update the model in accordance with human preferences.
In this paper, we analyze RLHF through the lens of reinforcement learning principles to develop an understanding of its fundamentals.
arXiv Detail & Related papers (2024-04-12T15:54:15Z) - HGOT: Hierarchical Graph of Thoughts for Retrieval-Augmented In-Context Learning in Factuality Evaluation [20.178644251662316]
We introduce the hierarchical graph of thoughts (HGOT) to enhance the retrieval of pertinent passages during in-context learning.
The framework employs the divide-and-conquer strategy to break down complex queries into manageable sub-queries.
It refines self-consistency majority voting for answer selection, which incorporates the recently proposed citation recall and precision metrics.
arXiv Detail & Related papers (2024-02-14T18:41:19Z) - Improving Machine Translation with Human Feedback: An Exploration of Quality Estimation as a Reward Model [75.66013048128302]
In this work, we investigate the potential of employing the QE model as the reward model to predict human preferences for feedback training.
We first identify the overoptimization problem during QE-based feedback training, manifested as an increase in reward while translation quality declines.
To address the problem, we adopt a simple yet effective method that uses rules to detect the incorrect translations and assigns a penalty term to the reward scores of them.
arXiv Detail & Related papers (2024-01-23T16:07:43Z) - Let's reward step by step: Step-Level reward model as the Navigators for
Reasoning [64.27898739929734]
Process-Supervised Reward Model (PRM) furnishes LLMs with step-by-step feedback during the training phase.
We propose a greedy search algorithm that employs the step-level feedback from PRM to optimize the reasoning pathways explored by LLMs.
To explore the versatility of our approach, we develop a novel method to automatically generate step-level reward dataset for coding tasks and observed similar improved performance in the code generation tasks.
arXiv Detail & Related papers (2023-10-16T05:21:50Z) - Harnessing the Power of Large Language Models for Empathetic Response Generation: Empirical Investigations and Improvements [28.630542719519855]
This work empirically investigates the performance of large language models (LLMs) in generating empathetic responses.
Extensive experiments show that LLMs can significantly benefit from our proposed methods and is able to achieve state-of-the-art performance in both automatic and human evaluations.
arXiv Detail & Related papers (2023-10-08T12:21:24Z) - Hierarchical Evaluation Framework: Best Practices for Human Evaluation [17.91641890651225]
The absence of widely accepted human evaluation metrics in NLP hampers fair comparisons among different systems and the establishment of universal assessment standards.
We develop our own hierarchical evaluation framework to provide a more comprehensive representation of the NLP system's performance.
In future work, we will investigate the potential time-saving benefits of our proposed framework for evaluators assessing NLP systems.
arXiv Detail & Related papers (2023-10-03T09:46:02Z) - Secrets of RLHF in Large Language Models Part I: PPO [81.01936993929127]
Large language models (LLMs) have formulated a blueprint for the advancement of artificial general intelligence.
reinforcement learning with human feedback (RLHF) emerges as the pivotal technological paradigm underpinning this pursuit.
In this report, we dissect the framework of RLHF, re-evaluate the inner workings of PPO, and explore how the parts comprising PPO algorithms impact policy agent training.
arXiv Detail & Related papers (2023-07-11T01:55:24Z) - Mapping Language to Programs using Multiple Reward Components with
Inverse Reinforcement Learning [12.107259467873092]
We pose program generation from language as Inverse Reinforcement Learning.
Fine-tuning with our approach achieves significantly better performance than competitive methods using Reinforcement Learning (RL)
Generated programs are also preferred by human evaluators over an RL-based approach, and rated higher on relevance, completeness, and human-likeness.
arXiv Detail & Related papers (2021-10-02T16:58:26Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z) - Cross Learning in Deep Q-Networks [82.20059754270302]
We propose a novel cross Q-learning algorithm, aim at alleviating the well-known overestimation problem in value-based reinforcement learning methods.
Our algorithm builds on double Q-learning, by maintaining a set of parallel models and estimate the Q-value based on a randomly selected network.
arXiv Detail & Related papers (2020-09-29T04:58:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.