ReFeR: Improving Evaluation and Reasoning through Hierarchy of Models
- URL: http://arxiv.org/abs/2407.12877v2
- Date: Wed, 9 Oct 2024 17:51:44 GMT
- Title: ReFeR: Improving Evaluation and Reasoning through Hierarchy of Models
- Authors: Yaswanth Narsupalli, Abhranil Chandra, Sreevatsa Muppirala, Manish Gupta, Pawan Goyal,
- Abstract summary: We introduce a tuning-free framework called ReFeR, designed to evaluate generative outputs, including both text and images.
We rigorously evaluate our framework, ReFeR, across four diverse evaluation tasks.
Experiments on four reasoning tasks demonstrate superior collective reasoning abilities of the framework.
- Score: 12.035509884945789
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Assessing the quality of outputs generated by generative models, such as large language models and vision language models, presents notable challenges. Traditional methods for evaluation typically rely on either human assessments, which are resource-intensive, or automatic metrics that often show a low correlation with human judgment. Another common approach is to use deep learning systems, which not only consume a substantial amount of compute and time but also require extensive training data. In this study, we introduce a tuning-free framework called ReFeR, designed to evaluate generative outputs, including both text and images, by leveraging a 2-level hierarchy of LLMs and VLMs themselves. We rigorously evaluate our framework, ReFeR, across four diverse evaluation tasks. The framework not only improves the accuracy of these evaluations, surpassing previous benchmarks but also generates constructive feedback. Interestingly, the framework is also applicable to reasoning tasks. Experiments on four reasoning tasks demonstrate superior collective reasoning abilities of the framework. We present two variants of the framework: ReFeR-Turbo, optimized for accelerated performance, and ReFeR-Lite, offering a more cost-effective solution. ReFeR-Lite is $\sim7.7\times$ more efficient while being comparably accurate to ReFeR-Turbo. We make code, data and PIP package publicly available. See this PIP URL https://pypi.org/project/refer-agents/ and this Git URL https://github.com/yaswanth-iitkgp/ReFeR_Code .
Related papers
- SFR-RAG: Towards Contextually Faithful LLMs [57.666165819196486]
Retrieval Augmented Generation (RAG) is a paradigm that integrates external contextual information with large language models (LLMs) to enhance factual accuracy and relevance.
We introduce SFR-RAG, a small LLM that is instruction-textual with an emphasis on context-grounded generation and hallucination.
We also present ConBench, a new evaluation framework compiling multiple popular and diverse RAG benchmarks.
arXiv Detail & Related papers (2024-09-16T01:08:18Z) - MaFeRw: Query Rewriting with Multi-Aspect Feedbacks for Retrieval-Augmented Large Language Models [34.39053202801489]
In a real-world RAG system, the current query often involves spoken ellipses and ambiguous references from dialogue contexts.
We propose a novel query rewriting method MaFeRw, which improves RAG performance by integrating multi-aspect feedback from both the retrieval process and generated results.
Experimental results on two conversational RAG datasets demonstrate that MaFeRw achieves superior generation metrics and more stable training compared to baselines.
arXiv Detail & Related papers (2024-08-30T07:57:30Z) - Evaluating Language Model Context Windows: A "Working Memory" Test and Inference-time Correction [10.428174043080622]
Large language models are prominently used in real-world applications, often tasked with reasoning over large volumes of documents.
We propose SWiM, an evaluation framework that addresses the limitations of standard tests.
We also propose medoid voting, a simple, but effective training-free approach that helps alleviate this effect.
arXiv Detail & Related papers (2024-07-04T05:46:20Z) - LLMs-as-Instructors: Learning from Errors Toward Automating Model Improvement [93.38736019287224]
"LLMs-as-Instructors" framework autonomously enhances the training of smaller target models.
Inspired by the theory of "Learning from Errors", this framework employs an instructor LLM to meticulously analyze the specific errors within a target model.
Within this framework, we implement two strategies: "Learning from Error," which focuses solely on incorrect responses to tailor training data, and "Learning from Error by Contrast", which uses contrastive learning to analyze both correct and incorrect responses for a deeper understanding of errors.
arXiv Detail & Related papers (2024-06-29T17:16:04Z) - RaFe: Ranking Feedback Improves Query Rewriting for RAG [83.24385658573198]
We propose a framework for training query rewriting models free of annotations.
By leveraging a publicly available reranker, oursprovides feedback aligned well with the rewriting objectives.
arXiv Detail & Related papers (2024-05-23T11:00:19Z) - Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM
Evaluation [51.99752147380505]
This paper presents a benchmark self-evolving framework to dynamically evaluate Large Language Models (LLMs)
We utilize a multi-agent system to manipulate the context or question of original instances, reframing new evolving instances with high confidence.
Our framework widens performance discrepancies both between different models and within the same model across various tasks.
arXiv Detail & Related papers (2024-02-18T03:40:06Z) - Leveraging Reinforcement Learning and Large Language Models for Code
Optimization [14.602997316032706]
This paper introduces a new framework to decrease the complexity of code optimization.
The proposed framework builds on large language models (LLMs) and reinforcement learning (RL)
We run several experiments on the PIE dataset using a CodeT5 language model and RRHF, a new reinforcement learning algorithm.
arXiv Detail & Related papers (2023-12-09T19:50:23Z) - Learning to Retrieve In-Context Examples for Large Language Models [69.9707552694766]
Large language models (LLMs) have demonstrated their ability to learn in-context.
The effectiveness of in-context learning is heavily reliant on the quality of the selected examples.
We propose a novel framework to iteratively train dense retrievers that can identify high-quality in-context examples.
arXiv Detail & Related papers (2023-07-14T05:23:08Z) - Building an Efficient and Effective Retrieval-based Dialogue System via
Mutual Learning [27.04857039060308]
We propose to combine the best of both worlds to build a retrieval system.
We employ a fast bi-encoder to replace the traditional feature-based pre-retrieval model.
We train the pre-retrieval model and the re-ranking model at the same time via mutual learning.
arXiv Detail & Related papers (2021-10-01T01:32:33Z) - Learning from Context or Names? An Empirical Study on Neural Relation
Extraction [112.06614505580501]
We study the effect of two main information sources in text: textual context and entity mentions (names)
We propose an entity-masked contrastive pre-training framework for relation extraction (RE)
Our framework can improve the effectiveness and robustness of neural models in different RE scenarios.
arXiv Detail & Related papers (2020-10-05T11:21:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.