Towards Reliable and Fluent Large Language Models: Incorporating
Feedback Learning Loops in QA Systems
- URL: http://arxiv.org/abs/2309.06384v1
- Date: Fri, 8 Sep 2023 09:39:53 GMT
- Title: Towards Reliable and Fluent Large Language Models: Incorporating
Feedback Learning Loops in QA Systems
- Authors: Dongyub Lee, Taesun Whang, Chanhee Lee, Heuiseok Lim
- Abstract summary: We build a dataset to train a critic model capable of evaluating the citation, correctness, and fluency of responses generated by large language models.
We propose an automated feedback mechanism that leverages the critic model to offer real-time feedback on heterogeneous aspects of generated text.
Experimental results demonstrate the efficacy of our approach, including a 4% precision increase in citation and an approximately 8% enhancement in the MAUVE metric for fluency.
- Score: 10.58737969057445
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have emerged as versatile tools in various daily
applications. However, they are fraught with issues that undermine their
utility and trustworthiness. These include the incorporation of erroneous
references (citation), the generation of hallucinated information
(correctness), and the inclusion of superfluous or omission of crucial details
(fluency). To ameliorate these concerns, this study makes several key
contributions. First, we build a dataset to train a critic model capable of
evaluating the citation, correctness, and fluency of responses generated by
LLMs in QA systems. Second, we propose an automated feedback mechanism that
leverages the critic model to offer real-time feedback on heterogeneous aspects
of generated text. Third, we introduce a feedback learning loop that uses this
critic model to iteratively improve the performance of the LLM responsible for
response generation. Experimental results demonstrate the efficacy of our
approach, showing substantial improvements in citation and fluency metrics for
ChatGPT, including a 4% precision increase in citation and an approximately 8%
enhancement in the MAUVE metric for fluency, while maintaining high levels of
correctness.
Related papers
- Investigating Automatic Scoring and Feedback using Large Language Models [46.1232919707345]
This paper explores the efficacy of PEFT-based quantized models, employing classification or regression head, to fine-tune language models for automatic grading and feedback generation.
The results show that prediction of grade scores via finetuned LLMs are highly accurate, achieving less than 3% error in grade percentage on average.
arXiv Detail & Related papers (2024-05-01T16:13:54Z) - Understanding the Effects of Iterative Prompting on Truthfulness [36.022674676543126]
We investigate the impact of iterative prompting on Large Language Models (LLMs) truthfulness.
We introduce several prompting variants designed to address the identified issues.
Our work provides a nuanced understanding of iterative prompting and introduces novel approaches to enhance the truthfulness of LLMs.
arXiv Detail & Related papers (2024-02-09T18:57:08Z) - Enhancing Large Language Model Performance To Answer Questions and
Extract Information More Accurately [2.1715455600756646]
Large Language Models (LLMs) generate responses to questions.
Their effectiveness is often hindered by sub-optimal quality of answers and occasional failures to provide accurate responses to questions.
To address these challenges, a fine-tuning process is employed, involving feedback and examples to refine models.
arXiv Detail & Related papers (2024-01-27T00:18:07Z) - Silkie: Preference Distillation for Large Visual Language Models [56.10697821410489]
This paper explores preference distillation for large vision language models (LVLMs)
We first build a vision-language feedback dataset utilizing AI annotation.
We adopt GPT-4V to assess the generated outputs regarding helpfulness, visual faithfulness, and ethical considerations.
The resulting model Silkie, achieves 6.9% and 9.5% relative improvement on the MME benchmark regarding the perception and cognition capabilities.
arXiv Detail & Related papers (2023-12-17T09:44:27Z) - CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting.
CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - N-Critics: Self-Refinement of Large Language Models with Ensemble of
Critics [5.516095889257118]
We propose a self-correction mechanism for Large Language Models (LLMs) to mitigate issues such as toxicity and fact hallucination.
This method involves refining model outputs through an ensemble of critics and the model's own feedback.
arXiv Detail & Related papers (2023-10-28T11:22:22Z) - ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases.
We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets.
Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z) - The Devil is in the Errors: Leveraging Large Language Models for
Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations.
We study the impact of labeled data through in-context learning and finetuning.
We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z) - CRITIC: Large Language Models Can Self-Correct with Tool-Interactive
Critiquing [139.77117915309023]
CRITIC allows large language models to validate and amend their own outputs in a manner similar to human interaction with tools.
Comprehensive evaluations involving free-form question answering, mathematical program synthesis, and toxicity reduction demonstrate that CRITIC consistently enhances the performance of LLMs.
arXiv Detail & Related papers (2023-05-19T15:19:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.