Towards Reliable and Fluent Large Language Models: Incorporating
Feedback Learning Loops in QA Systems
- URL: http://arxiv.org/abs/2309.06384v1
- Date: Fri, 8 Sep 2023 09:39:53 GMT
- Title: Towards Reliable and Fluent Large Language Models: Incorporating
Feedback Learning Loops in QA Systems
- Authors: Dongyub Lee, Taesun Whang, Chanhee Lee, Heuiseok Lim
- Abstract summary: We build a dataset to train a critic model capable of evaluating the citation, correctness, and fluency of responses generated by large language models.
We propose an automated feedback mechanism that leverages the critic model to offer real-time feedback on heterogeneous aspects of generated text.
Experimental results demonstrate the efficacy of our approach, including a 4% precision increase in citation and an approximately 8% enhancement in the MAUVE metric for fluency.
- Score: 10.58737969057445
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have emerged as versatile tools in various daily
applications. However, they are fraught with issues that undermine their
utility and trustworthiness. These include the incorporation of erroneous
references (citation), the generation of hallucinated information
(correctness), and the inclusion of superfluous or omission of crucial details
(fluency). To ameliorate these concerns, this study makes several key
contributions. First, we build a dataset to train a critic model capable of
evaluating the citation, correctness, and fluency of responses generated by
LLMs in QA systems. Second, we propose an automated feedback mechanism that
leverages the critic model to offer real-time feedback on heterogeneous aspects
of generated text. Third, we introduce a feedback learning loop that uses this
critic model to iteratively improve the performance of the LLM responsible for
response generation. Experimental results demonstrate the efficacy of our
approach, showing substantial improvements in citation and fluency metrics for
ChatGPT, including a 4% precision increase in citation and an approximately 8%
enhancement in the MAUVE metric for fluency, while maintaining high levels of
correctness.
Related papers
- Training Language Models to Critique With Multi-agent Feedback [102.42751835338233]
MultiCritique pipeline improves critique ability of LLMs by utilizing multi-agent feedback.
pipeline aggregates high-quality critiques from multiple agents instead of a single model.
Our fine-tuned 7B model significantly surpasses other advanced 7B-13B open-source models.
arXiv Detail & Related papers (2024-10-20T04:57:45Z) - On the Capacity of Citation Generation by Large Language Models [38.47160164251295]
Retrieval-augmented generation (RAG) appears as a promising method to alleviate the "hallucination" problem in large language models (LLMs)
arXiv Detail & Related papers (2024-10-15T03:04:26Z) - VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment [55.7956150385255]
We investigate the efficacy of AI feedback to scale supervision for aligning vision-language models.
We introduce VLFeedback, the first large-scale vision-language feedback dataset.
We train Silkie, an LVLM fine-tuned via direct preference optimization on VLFeedback.
arXiv Detail & Related papers (2024-10-12T07:56:47Z) - Improving Retrieval Augmented Language Model with Self-Reasoning [20.715106330314605]
We propose a novel self-reasoning framework aimed at improving the reliability and traceability of RALMs.
The framework involves constructing self-reason trajectories with three processes: a relevance-aware process, an evidence-aware selective process, and a trajectory analysis process.
We have evaluated our framework across four public datasets to demonstrate the superiority of our method.
arXiv Detail & Related papers (2024-07-29T09:05:10Z) - Investigating Automatic Scoring and Feedback using Large Language Models [46.1232919707345]
This paper explores the efficacy of PEFT-based quantized models, employing classification or regression head, to fine-tune language models for automatic grading and feedback generation.
The results show that prediction of grade scores via finetuned LLMs are highly accurate, achieving less than 3% error in grade percentage on average.
arXiv Detail & Related papers (2024-05-01T16:13:54Z) - Enhancing Large Language Model Performance To Answer Questions and
Extract Information More Accurately [2.1715455600756646]
Large Language Models (LLMs) generate responses to questions.
Their effectiveness is often hindered by sub-optimal quality of answers and occasional failures to provide accurate responses to questions.
To address these challenges, a fine-tuning process is employed, involving feedback and examples to refine models.
arXiv Detail & Related papers (2024-01-27T00:18:07Z) - Silkie: Preference Distillation for Large Visual Language Models [56.10697821410489]
This paper explores preference distillation for large vision language models (LVLMs)
We first build a vision-language feedback dataset utilizing AI annotation.
We adopt GPT-4V to assess the generated outputs regarding helpfulness, visual faithfulness, and ethical considerations.
The resulting model Silkie, achieves 6.9% and 9.5% relative improvement on the MME benchmark regarding the perception and cognition capabilities.
arXiv Detail & Related papers (2023-12-17T09:44:27Z) - CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting.
CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z) - ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases.
We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets.
Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z) - The Devil is in the Errors: Leveraging Large Language Models for
Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations.
We study the impact of labeled data through in-context learning and finetuning.
We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.