Related papers: Rate, Explain and Cite (REC): Enhanced Explanation and Attribution in Automatic Evaluation by Large Language Models

Rate, Explain and Cite (REC): Enhanced Explanation and Attribution in Automatic Evaluation by Large Language Models

URL: http://arxiv.org/abs/2411.02448v2
Date: Wed, 19 Feb 2025 00:50:10 GMT
Title: Rate, Explain and Cite (REC): Enhanced Explanation and Attribution in Automatic Evaluation by Large Language Models
Authors: Aliyah R. Hsu, James Zhu, Zhichao Wang, Bin Bi, Shubham Mehrotra, Shiva K. Pentyala, Katherine Tan, Xiang-Bo Mao, Roshanak Omrani, Sougata Chaudhuri, Regunathan Radhakrishnan, Sitaram Asur, Claire Na Cheng, Bin Yu,
Abstract summary: This paper introduces three fine-tuned general-purpose LLM autoevaluators, REC-8B, REC-12B and REC-70B.<n>They are designed to evaluate generated text across several dimensions: faithfulness, instruction following, coherence, and completeness.<n>Our general-purpose LLM auto-evaluator, REC-70B, outperforms state-of-the-art LLMs, excelling in content evaluation by delivering better quality explanation and citation with minimal bias.
Score: 14.366425348130289
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LLMs have demonstrated impressive proficiency in generating coherent and high-quality text, making them valuable across a range of text-generation tasks. However, rigorous evaluation of this generated content is crucial, as ensuring its quality remains a significant challenge due to persistent issues such as factual inaccuracies and hallucination. This paper introduces three fine-tuned general-purpose LLM autoevaluators, REC-8B, REC-12B and REC-70B, specifically designed to evaluate generated text across several dimensions: faithfulness, instruction following, coherence, and completeness. These models not only provide ratings for these metrics but also offer detailed explanation and verifiable citation, thereby enhancing trust in the content. Moreover, the models support various citation modes, accommodating different requirements for latency and granularity. Extensive evaluations on diverse benchmarks demonstrate that our general-purpose LLM auto-evaluator, REC-70B, outperforms state-of-the-art LLMs, excelling in content evaluation by delivering better quality explanation and citation with minimal bias. It achieves Rank #1 as of Feb 15th, 2025 as a generative model on the RewardBench leaderboard under the model name TextEval-Llama3.1-70B. Our REC dataset and models are available at https://github.com/adelaidehsu/REC.

Related papers

WritingBench: A Comprehensive Benchmark for Generative Writing [87.48445972563631]
We present WritingBench, a benchmark designed to evaluate large language models (LLMs) across 6 core writing domains and 100, encompassing creative, persuasive, informative, and technical writing. We propose a query-dependent evaluation framework that empowers LLMs to dynamically generate instance-specific assessment criteria. This framework is complemented by a fine-tuned critic model for criteria-aware scoring, enabling evaluations in style, format and length.
arXiv Detail & Related papers (2025-03-07T08:56:20Z)
Consistency Evaluation of News Article Summaries Generated by Large (and Small) Language Models [0.0]
Large Language Models (LLMs) have shown promise in generating fluent abstractive summaries but they can produce hallucinated details not grounded in the source text. This paper embarks on an exploration of text summarization with a diverse set of techniques, including TextRank, BART, Mistral-7B-Instruct, and OpenAI GPT-3.5-Turbo. We find that all summarization models produce consistent summaries when tested on the XL-Sum dataset.
arXiv Detail & Related papers (2025-02-28T01:58:17Z)
CoSER: Coordinating LLM-Based Persona Simulation of Established Roles [62.886267684392635]
CoSER dataset covers 17,966 characters from 771 renowned books. We develop CoSER 8B and CoSER 70B, i.e., advanced open role-playing LLMs built on LLaMA-3.1 models.
arXiv Detail & Related papers (2025-02-13T08:55:24Z)
Beyond Content Relevance: Evaluating Instruction Following in Retrieval Models [17.202017214385826]
This study evaluates the instruction-following capabilities of various retrieval models beyond content relevance. We develop a novel retrieval evaluation benchmark spanning six document-level attributes. Our findings reveal that while reranking models generally surpass retrieval models in instruction following, they still face challenges in handling certain attributes.
arXiv Detail & Related papers (2024-10-31T11:47:21Z)
On the Capacity of Citation Generation by Large Language Models [38.47160164251295]
Retrieval-augmented generation (RAG) appears as a promising method to alleviate the "hallucination" problem in large language models (LLMs)
arXiv Detail & Related papers (2024-10-15T03:04:26Z)
Peering into the Mind of Language Models: An Approach for Attribution in Contextual Question Answering [9.86691461253151]
We introduce a novel method for attribution in contextual question answering, leveraging the hidden state representations of large language models (LLMs) Our approach bypasses the need for extensive model retraining and retrieval model overhead, offering granular attributions and preserving the quality of generated answers. We present Verifiability-granular, an attribution dataset which has token level annotations for LLM generations in the contextual question answering setup.
arXiv Detail & Related papers (2024-05-28T09:12:44Z)
RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation. Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z)
Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z)
Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral. This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z)
CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting. CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z)
LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores [23.568883428947494]
We investigate whether prominent LM-based evaluation metrics demonstrate a favorable bias toward their respective underlying LMs in the context of summarization tasks. Our findings unveil a latent bias, particularly pronounced when such evaluation metrics are used in a reference-free manner without leveraging gold summaries. These results underscore that assessments provided by generative evaluation models can be influenced by factors beyond the inherent text quality.
arXiv Detail & Related papers (2023-11-16T10:43:26Z)
Sequencing Matters: A Generate-Retrieve-Generate Model for Building Conversational Agents [9.191944519634111]
The Georgetown InfoSense group has done in regard to solving the challenges presented by TREC iKAT 2023. Our submitted runs outperform the median runs by a significant margin, exhibiting superior performance in nDCG across various cut numbers and in overall success rate. Our solution involves the use of Large Language Models (LLMs) for initial answers, answer grounding by BM25, passage quality filtering by logistic regression, and answer generation by LLMs again.
arXiv Detail & Related papers (2023-11-16T02:37:58Z)
BLESS: Benchmarking Large Language Models on Sentence Simplification [55.461555829492866]
We present BLESS, a performance benchmark of the most recent state-of-the-art large language models (LLMs) on the task of text simplification (TS) We assess a total of 44 models, differing in size, architecture, pre-training methods, and accessibility, on three test sets from different domains (Wikipedia, news, and medical) under a few-shot setting. Our evaluation indicates that the best LLMs, despite not being trained on TS, perform comparably with state-of-the-art TS baselines.
arXiv Detail & Related papers (2023-10-24T12:18:17Z)
Enabling Large Language Models to Generate Text with Citations [37.64884969997378]
Large language models (LLMs) have emerged as a widely-used tool for information seeking. Our aim is to allow LLMs to generate text with citations, improving their factual correctness and verifiability. We propose ALCE, the first benchmark for Automatic LLMs' Citation Evaluation.
arXiv Detail & Related papers (2023-05-24T01:53:49Z)
Large Language Models are Diverse Role-Players for Summarization Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal. Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions. We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z)
Benchmarking Large Language Models for News Summarization [79.37850439866938]
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. We find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability.
arXiv Detail & Related papers (2023-01-31T18:46:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.