Related papers: Review-Feedback-Reason (ReFeR): A Novel Framework for NLG Evaluation and Reasoning

Review-Feedback-Reason (ReFeR): A Novel Framework for NLG Evaluation and Reasoning

URL: http://arxiv.org/abs/2407.12877v1
Date: Tue, 16 Jul 2024 08:25:26 GMT
Title: Review-Feedback-Reason (ReFeR): A Novel Framework for NLG Evaluation and Reasoning
Authors: Yaswanth Narsupalli, Abhranil Chandra, Sreevatsa Muppirala, Manish Gupta, Pawan Goyal,
Abstract summary: Review-Feedback-Reason (ReFeR) is a novel evaluation framework for NLG using LLM agents. We rigorously test ReFeR using two pre-existing benchmark datasets on diverse NLG tasks. We highlight the effectiveness of our methodology through its application on three reasoning benchmarks.
Score: 12.035509884945789
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Assessing the quality of Natural Language Generation (NLG) outputs, such as those produced by large language models (LLMs), poses significant challenges. Traditional approaches involve either resource-intensive human evaluations or automatic metrics, which often exhibit a low correlation with human judgment. In this study, we propose Review-Feedback-Reason (ReFeR), a novel evaluation framework for NLG using LLM agents. We rigorously test ReFeR using two pre-existing benchmark datasets on diverse NLG tasks. The proposed framework not only enhances the accuracy of NLG evaluation, surpassing previous benchmarks by $\sim$20\%, but also generates constructive feedback and significantly improves collective reasoning. This feedback is then leveraged for the creation of instruction-tuning datasets, which, when used to fine-tune smaller models like Mistral-7B, makes them extremely good evaluators, yielding a better correlation with human evaluations and performance nearly on par with GPT-3.5. We highlight the effectiveness of our methodology through its application on three reasoning benchmarks, where it outperforms most of the state-of-the-art methods, and also outperforms the reasoning capabilities of models like GPT-3.5 Turbo by $\sim$11.67\% and GPT-4 by $\sim$1\% on an average.

Related papers

When Retriever Meets Generator: A Joint Model for Code Comment Generation [3.6781644685120924]
RAGSum is built on top offuse retrieval and generation using a single CodeT5 backbone.<n>A contrastive pre-training phase shapes code embeddings for nearest-neighbor search.<n>A lightweight self-refinement loop is deployed to polish the final output.
arXiv Detail & Related papers (2025-07-16T18:12:27Z)
Reinforcing Video Reasoning with Focused Thinking [65.85683941058916]
We propose TW-GRPO, a novel framework that enhances visual reasoning with focused thinking and dense reward granularity.<n>Specifically, we employ a token weighting mechanism that prioritizes tokens with high informational density.<n>We also reformulate RL training by shifting from single-choice to multi-choice QA tasks.
arXiv Detail & Related papers (2025-05-30T15:42:19Z)
Writing-Zero: Bridge the Gap Between Non-verifiable Tasks and Verifiable Rewards [11.149294285483782]
We propose a unified RLVR-based training paradigm that bridges the gap between non-verifiable tasks and verifiable rewards.<n>We introduce a writing-principle-based pairwise Generative Reward Model (GenRM) and a novel Bootstrapped Relative Policy Optimization (BRPO) algorithm.<n>Our approach empowers LLMs to develop robust writing capabilities without supervised fine-tuning.
arXiv Detail & Related papers (2025-05-30T14:34:57Z)
From Rankings to Insights: Evaluation Should Shift Focus from Leaderboard to Feedback [36.68929551237421]
We introduce bftextFeedbacker, an evaluation framework that provides comprehensive and fine-grained results.<n>Our project homepage and dataset are available at https://liudan193.io/Feedbacker.
arXiv Detail & Related papers (2025-05-10T16:52:40Z)
REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models [8.587685197004097]
REINFORCE++ is a novel approach that removes the critic model while using the normalized reward of a batch as the baseline. It exhibits robust performance across various reward models without requiring prompt set truncation. It achieves superior generalization in both RLHF and long chain-of-thought settings compared to existing REINFORCE-based methods.
arXiv Detail & Related papers (2025-01-04T02:08:06Z)
SFR-RAG: Towards Contextually Faithful LLMs [57.666165819196486]
Retrieval Augmented Generation (RAG) is a paradigm that integrates external contextual information with large language models (LLMs) to enhance factual accuracy and relevance. We introduce SFR-RAG, a small LLM that is instruction-textual with an emphasis on context-grounded generation and hallucination. We also present ConBench, a new evaluation framework compiling multiple popular and diverse RAG benchmarks.
arXiv Detail & Related papers (2024-09-16T01:08:18Z)
MaFeRw: Query Rewriting with Multi-Aspect Feedbacks for Retrieval-Augmented Large Language Models [34.39053202801489]
In a real-world RAG system, the current query often involves spoken ellipses and ambiguous references from dialogue contexts. We propose a novel query rewriting method MaFeRw, which improves RAG performance by integrating multi-aspect feedback from both the retrieval process and generated results. Experimental results on two conversational RAG datasets demonstrate that MaFeRw achieves superior generation metrics and more stable training compared to baselines.
arXiv Detail & Related papers (2024-08-30T07:57:30Z)
RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [66.93260816493553]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios. With a focus on factual accuracy, we propose three novel metrics: Completeness, Hallucination, and Irrelevance. Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z)
Evaluating Language Model Context Windows: A "Working Memory" Test and Inference-time Correction [10.428174043080622]
Large language models are prominently used in real-world applications, often tasked with reasoning over large volumes of documents. We propose SWiM, an evaluation framework that addresses the limitations of standard tests. We also propose medoid voting, a simple, but effective training-free approach that helps alleviate this effect.
arXiv Detail & Related papers (2024-07-04T05:46:20Z)
LLMs-as-Instructors: Learning from Errors Toward Automating Model Improvement [93.38736019287224]
"LLMs-as-Instructors" framework autonomously enhances the training of smaller target models. Inspired by the theory of "Learning from Errors", this framework employs an instructor LLM to meticulously analyze the specific errors within a target model. Within this framework, we implement two strategies: "Learning from Error," which focuses solely on incorrect responses to tailor training data, and "Learning from Error by Contrast", which uses contrastive learning to analyze both correct and incorrect responses for a deeper understanding of errors.
arXiv Detail & Related papers (2024-06-29T17:16:04Z)
RaFe: Ranking Feedback Improves Query Rewriting for RAG [83.24385658573198]
We propose a framework for training query rewriting models free of annotations. By leveraging a publicly available reranker, oursprovides feedback aligned well with the rewriting objectives.
arXiv Detail & Related papers (2024-05-23T11:00:19Z)
Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation [51.99752147380505]
This paper presents a benchmark self-evolving framework to dynamically evaluate Large Language Models (LLMs) We utilize a multi-agent system to manipulate the context or question of original instances, reframing new evolving instances with high confidence. Our framework widens performance discrepancies both between different models and within the same model across various tasks.
arXiv Detail & Related papers (2024-02-18T03:40:06Z)
Leveraging Reinforcement Learning and Large Language Models for Code Optimization [14.602997316032706]
This paper introduces a new framework to decrease the complexity of code optimization. The proposed framework builds on large language models (LLMs) and reinforcement learning (RL) We run several experiments on the PIE dataset using a CodeT5 language model and RRHF, a new reinforcement learning algorithm.
arXiv Detail & Related papers (2023-12-09T19:50:23Z)
Learning to Retrieve In-Context Examples for Large Language Models [69.9707552694766]
Large language models (LLMs) have demonstrated their ability to learn in-context. The effectiveness of in-context learning is heavily reliant on the quality of the selected examples. We propose a novel framework to iteratively train dense retrievers that can identify high-quality in-context examples.
arXiv Detail & Related papers (2023-07-14T05:23:08Z)
Building an Efficient and Effective Retrieval-based Dialogue System via Mutual Learning [27.04857039060308]
We propose to combine the best of both worlds to build a retrieval system. We employ a fast bi-encoder to replace the traditional feature-based pre-retrieval model. We train the pre-retrieval model and the re-ranking model at the same time via mutual learning.
arXiv Detail & Related papers (2021-10-01T01:32:33Z)
Learning from Context or Names? An Empirical Study on Neural Relation Extraction [112.06614505580501]
We study the effect of two main information sources in text: textual context and entity mentions (names) We propose an entity-masked contrastive pre-training framework for relation extraction (RE) Our framework can improve the effectiveness and robustness of neural models in different RE scenarios.
arXiv Detail & Related papers (2020-10-05T11:21:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.