Self-Refinement of Language Models from External Proxy Metrics Feedback
- URL: http://arxiv.org/abs/2403.00827v1
- Date: Tue, 27 Feb 2024 19:13:01 GMT
- Title: Self-Refinement of Language Models from External Proxy Metrics Feedback
- Authors: Keshav Ramji, Young-Suk Lee, Ram\'on Fernandez Astudillo, Md Arafat
Sultan, Tahira Naseem, Asim Munawar, Radu Florian, Salim Roukos
- Abstract summary: Proxy Metric-based Self-Refinement (ProMiSe)
ProMiSe iteratively refines its response one principle at a time.
We apply ProMiSe to open source language models Flan-T5-XXL and Llama-2-13B-Chat.
- Score: 27.57840561708484
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: It is often desirable for Large Language Models (LLMs) to capture multiple
objectives when providing a response. In document-grounded response generation,
for example, agent responses are expected to be relevant to a user's query
while also being grounded in a given document. In this paper, we introduce
Proxy Metric-based Self-Refinement (ProMiSe), which enables an LLM to refine
its own initial response along key dimensions of quality guided by external
metrics feedback, yielding an overall better final response. ProMiSe leverages
feedback on response quality through principle-specific proxy metrics, and
iteratively refines its response one principle at a time. We apply ProMiSe to
open source language models Flan-T5-XXL and Llama-2-13B-Chat, to evaluate its
performance on document-grounded question answering datasets, MultiDoc2Dial and
QuAC, demonstrating that self-refinement improves response quality. We further
show that fine-tuning Llama-2-13B-Chat on the synthetic dialogue data generated
by ProMiSe yields significant performance improvements over the zero-shot
baseline as well as a supervised fine-tuned model on human annotated data.
Related papers
- Model Internals-based Answer Attribution for Trustworthy Retrieval-Augmented Generation [8.975024781390077]
We present MIRAGE --Model Internals-based RAG Explanations -- a plug-and-play approach using model internals for faithful answer attribution in question answering applications.
We evaluate our proposed approach on a multilingual QA dataset, finding high agreement with human answer attribution.
arXiv Detail & Related papers (2024-06-19T16:10:26Z) - Fine-Tuning or Fine-Failing? Debunking Performance Myths in Large Language Models [0.8399688944263842]
Large Language Models (LLMs) have the capability to understand and generate human-like text from input queries.
This study extends this concept to the integration of LLMs within Retrieval-Augmented Generation (RAG) pipelines.
We evaluate the impact of fine-tuning on the LLMs' capacity for data extraction and contextual understanding.
arXiv Detail & Related papers (2024-06-17T04:35:17Z) - CaLM: Contrasting Large and Small Language Models to Verify Grounded Generation [76.31621715032558]
Grounded generation aims to equip language models (LMs) with the ability to produce more credible and accountable responses.
We introduce CaLM, a novel verification framework.
Our framework empowers smaller LMs, which rely less on parametric memory, to validate the output of larger LMs.
arXiv Detail & Related papers (2024-06-08T06:04:55Z) - Re-ReST: Reflection-Reinforced Self-Training for Language Agents [101.22559705696885]
Self-training in language agents can generate supervision from the agent itself.
We present Reflection-Reinforced Self-Training (Re-ReST), which uses a textitreflector to refine low-quality generated samples.
arXiv Detail & Related papers (2024-06-03T16:21:38Z) - Enhancing Large Language Model Performance To Answer Questions and
Extract Information More Accurately [2.1715455600756646]
Large Language Models (LLMs) generate responses to questions.
Their effectiveness is often hindered by sub-optimal quality of answers and occasional failures to provide accurate responses to questions.
To address these challenges, a fine-tuning process is employed, involving feedback and examples to refine models.
arXiv Detail & Related papers (2024-01-27T00:18:07Z) - PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models [72.57329554067195]
ProxyQA is an innovative framework dedicated to assessing longtext generation.
It comprises in-depth human-curated meta-questions spanning various domains, each accompanied by specific proxy-questions with pre-annotated answers.
It assesses the generated content's quality through the evaluator's accuracy in addressing the proxy-questions.
arXiv Detail & Related papers (2024-01-26T18:12:25Z) - Effective Large Language Model Adaptation for Improved Grounding and Citation Generation [48.07830615309543]
This paper focuses on improving large language models (LLMs) by grounding their responses in retrieved passages and by providing citations.
We propose a new framework, AGREE, that improves the grounding from a holistic perspective.
Our framework tunes LLMs to selfground the claims in their responses and provide accurate citations to retrieved documents.
arXiv Detail & Related papers (2023-11-16T03:22:25Z) - LLMRefine: Pinpointing and Refining Large Language Models via Fine-Grained Actionable Feedback [65.84061725174269]
Recent large language models (LLM) are leveraging human feedback to improve their generation quality.
We propose LLMRefine, an inference time optimization method to refine LLM's output.
We conduct experiments on three text generation tasks, including machine translation, long-form question answering (QA), and topical summarization.
LLMRefine consistently outperforms all baseline approaches, achieving improvements up to 1.7 MetricX points on translation tasks, 8.1 ROUGE-L on ASQA, 2.2 ROUGE-L on topical summarization.
arXiv Detail & Related papers (2023-11-15T19:52:11Z) - Towards Reliable and Fluent Large Language Models: Incorporating
Feedback Learning Loops in QA Systems [10.58737969057445]
We build a dataset to train a critic model capable of evaluating the citation, correctness, and fluency of responses generated by large language models.
We propose an automated feedback mechanism that leverages the critic model to offer real-time feedback on heterogeneous aspects of generated text.
Experimental results demonstrate the efficacy of our approach, including a 4% precision increase in citation and an approximately 8% enhancement in the MAUVE metric for fluency.
arXiv Detail & Related papers (2023-09-08T09:39:53Z) - Read before Generate! Faithful Long Form Question Answering with Machine
Reading [77.17898499652306]
Long-form question answering (LFQA) aims to generate a paragraph-length answer for a given question.
We propose a new end-to-end framework that jointly models answer generation and machine reading.
arXiv Detail & Related papers (2022-03-01T10:41:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.