DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and
Improvement of Large Language Models
- URL: http://arxiv.org/abs/2401.02132v1
- Date: Thu, 4 Jan 2024 08:34:16 GMT
- Title: DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and
Improvement of Large Language Models
- Authors: Wendi Cui, Jiaxin Zhang, Zhuohang Li, Lopez Damien, Kamalika Das,
Bradley Malin, Sricharan Kumar
- Abstract summary: This work proposes DCR, an automated framework for evaluating and improving the consistency of Large Language Models (LLMs) generated texts.
We introduce an automatic metric converter (AMC) that translates the output from DCE into an interpretable numeric score.
Our approach also substantially reduces nearly 90% of output inconsistencies, showing promise for effective hallucination mitigation.
- Score: 4.953092503184905
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Evaluating the quality and variability of text generated by Large Language
Models (LLMs) poses a significant, yet unresolved research challenge.
Traditional evaluation methods, such as ROUGE and BERTScore, which measure
token similarity, often fail to capture the holistic semantic equivalence. This
results in a low correlation with human judgments and intuition, which is
especially problematic in high-stakes applications like healthcare and finance
where reliability, safety, and robust decision-making are highly critical. This
work proposes DCR, an automated framework for evaluating and improving the
consistency of LLM-generated texts using a divide-conquer-reasoning approach.
Unlike existing LLM-based evaluators that operate at the paragraph level, our
method employs a divide-and-conquer evaluator (DCE) that breaks down the
paragraph-to-paragraph comparison between two generated responses into
individual sentence-to-paragraph comparisons, each evaluated based on
predefined criteria. To facilitate this approach, we introduce an automatic
metric converter (AMC) that translates the output from DCE into an
interpretable numeric score. Beyond the consistency evaluation, we further
present a reason-assisted improver (RAI) that leverages the analytical reasons
with explanations identified by DCE to generate new responses aimed at reducing
these inconsistencies. Through comprehensive and systematic empirical analysis,
we show that our approach outperforms state-of-the-art methods by a large
margin (e.g., +19.3% and +24.3% on the SummEval dataset) in evaluating the
consistency of LLM generation across multiple benchmarks in semantic, factual,
and summarization consistency tasks. Our approach also substantially reduces
nearly 90% of output inconsistencies, showing promise for effective
hallucination mitigation.
Related papers
- RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [69.4501863547618]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios.
With a focus on factual accuracy, we propose three novel metrics Completeness, Hallucination, and Irrelevance.
Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z) - A Comparative Study of Quality Evaluation Methods for Text Summarization [0.5512295869673147]
This paper proposes a novel method based on large language models (LLMs) for evaluating text summarization.
Our results show that LLMs evaluation aligns closely with human evaluation, while widely-used automatic metrics such as ROUGE-2, BERTScore, and SummaC do not and also lack consistency.
arXiv Detail & Related papers (2024-06-30T16:12:37Z) - Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales.
We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - MATEval: A Multi-Agent Discussion Framework for Advancing Open-Ended Text Evaluation [22.19073789961769]
generative Large Language Models (LLMs) have been remarkable, however, the quality of the text generated by these models often reveals persistent issues.
We propose the MATEval: A "Multi-Agent Text Evaluation framework"
Our framework incorporates self-reflection and Chain-of-Thought strategies, along with feedback mechanisms, to enhance the depth and breadth of the evaluation process.
arXiv Detail & Related papers (2024-03-28T10:41:47Z) - CheckEval: Robust Evaluation Framework using Large Language Model via Checklist [6.713203569074019]
We introduce CheckEval, a novel evaluation framework using Large Language Models.
CheckEval addresses the challenges of ambiguity and inconsistency in current evaluation methods.
arXiv Detail & Related papers (2024-03-27T17:20:39Z) - Sentiment Analysis through LLM Negotiations [58.67939611291001]
A standard paradigm for sentiment analysis is to rely on a singular LLM and makes the decision in a single round.
This paper introduces a multi-LLM negotiation framework for sentiment analysis.
arXiv Detail & Related papers (2023-11-03T12:35:29Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - Evaluating and Improving Factuality in Multimodal Abstractive
Summarization [91.46015013816083]
We propose CLIPBERTScore to leverage the robustness and strong factuality detection performance between image-summary and document-summary.
We show that this simple combination of two metrics in the zero-shot achieves higher correlations than existing factuality metrics for document summarization.
Our analysis demonstrates the robustness and high correlation of CLIPBERTScore and its components on four factuality metric-evaluation benchmarks.
arXiv Detail & Related papers (2022-11-04T16:50:40Z) - TRUE: Re-evaluating Factual Consistency Evaluation [29.888885917330327]
We introduce TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks.
Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations.
Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results.
arXiv Detail & Related papers (2022-04-11T10:14:35Z) - TISE: A Toolbox for Text-to-Image Synthesis Evaluation [9.092600296992925]
We conduct a study on state-of-the-art methods for single- and multi-object text-to-image synthesis.
We propose a common framework for evaluating these methods.
arXiv Detail & Related papers (2021-12-02T16:39:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.