SAGEval: The frontiers of Satisfactory Agent based NLG Evaluation for reference-free open-ended text
- URL: http://arxiv.org/abs/2411.16077v1
- Date: Mon, 25 Nov 2024 04:07:16 GMT
- Title: SAGEval: The frontiers of Satisfactory Agent based NLG Evaluation for reference-free open-ended text
- Authors: Reshmi Ghosh, Tianyi Yao, Lizzy Chen, Sadid Hasan, Tianwei Chen, Dario Bernal, Huitian Jiao, H M Sajjad Hossain,
- Abstract summary: This paper identifies the need to develop robust evaluation approaches for natural language generation, wherein references/ground labels doesn't exist or isn't amply available.
We show that the critiquing Agent is able to rectify scores from LLM evaluators, thereby reducing the need for labeled data even for complex NLG evaluation scenarios.
- Score: 0.848663031844483
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Model (LLM) integrations into applications like Microsoft365 suite and Google Workspace for creating/processing documents, emails, presentations, etc. has led to considerable enhancements in productivity and time savings. But as these integrations become more more complex, it is paramount to ensure that the quality of output from the LLM-integrated applications are relevant and appropriate for use. Identifying the need to develop robust evaluation approaches for natural language generation, wherein references/ground labels doesn't exist or isn't amply available, this paper introduces a novel framework called "SAGEval" which utilizes a critiquing Agent to provide feedback on scores generated by LLM evaluators. We show that the critiquing Agent is able to rectify scores from LLM evaluators, in absence of references/ground-truth labels, thereby reducing the need for labeled data even for complex NLG evaluation scenarios, like the generation of JSON-structured forms/surveys with responses in different styles like multiple choice, likert ratings, single choice questions, etc.
Related papers
- Can LLMs Generate Tabular Summaries of Science Papers? Rethinking the Evaluation Protocol [83.90769864167301]
Literature review tables are essential for summarizing and comparing collections of scientific papers.
We explore the task of generating tables that best fulfill a user's informational needs given a collection of scientific papers.
Our contributions focus on three key challenges encountered in real-world use: (i) User prompts are often under-specified; (ii) Retrieved candidate papers frequently contain irrelevant content; and (iii) Task evaluation should move beyond shallow text similarity techniques.
arXiv Detail & Related papers (2025-04-14T14:52:28Z) - TALE: A Tool-Augmented Framework for Reference-Free Evaluation of Large Language Models [16.857263524133284]
Large Language Models (LLMs) are increasingly integrated into real-world, autonomous applications.
relying on static, pre-annotated references for evaluation poses significant challenges in cost, scalability, and completeness.
We propose Tool-Augmented LLM Evaluation (TALE), a framework to assess LLM outputs without predetermined ground-truth answers.
arXiv Detail & Related papers (2025-04-10T02:08:41Z) - Rubric Is All You Need: Enhancing LLM-based Code Evaluation With Question-Specific Rubrics [1.3707925738322797]
We focus on LLM-based code evaluation and attempt to fill in the existing gaps.
We propose multi-agentic novel approaches using question-specific rubrics tailored to the problem statement.
Our comprehensive analysis demonstrates that question-specific rubrics significantly enhance logical assessment of code in educational settings.
arXiv Detail & Related papers (2025-03-31T11:59:43Z) - WritingBench: A Comprehensive Benchmark for Generative Writing [87.48445972563631]
We present WritingBench, a benchmark designed to evaluate large language models (LLMs) across 6 core writing domains and 100, encompassing creative, persuasive, informative, and technical writing.
We propose a query-dependent evaluation framework that empowers LLMs to dynamically generate instance-specific assessment criteria.
This framework is complemented by a fine-tuned critic model for criteria-aware scoring, enabling evaluations in style, format and length.
arXiv Detail & Related papers (2025-03-07T08:56:20Z) - LatteReview: A Multi-Agent Framework for Systematic Review Automation Using Large Language Models [0.0]
LatteReview is a Python-based framework that leverages large language models (LLMs) and multi-agent systems to automate key elements of the systematic review process.
The framework supports features such as Retrieval-Augmented Generation (RAG) for incorporating external context, multimodal reviews, Pydantic-based validation for structured inputs and outputs, and asynchronous programming for handling large-scale datasets.
arXiv Detail & Related papers (2025-01-05T17:53:00Z) - DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing [12.555427275787174]
We present DocETL, a system that optimize complex document processing pipelines.
DocETL offers a declarative interface for users to define such pipelines and uses an agent-based framework to automatically optimize them.
We show that DocETL finds plans with outputs that are $1.34$ to $4.6times$ higher quality than well-engineered baselines.
arXiv Detail & Related papers (2024-10-16T03:22:35Z) - RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [69.4501863547618]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios.
With a focus on factual accuracy, we propose three novel metrics Completeness, Hallucination, and Irrelevance.
Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z) - DCA-Bench: A Benchmark for Dataset Curation Agents [9.60250892491588]
We propose a dataset curation agent benchmark, DCA-Bench, to measure large language models' capability of detecting hidden dataset quality issues.
Specifically, we collect diverse real-world dataset quality issues from eight open dataset platforms as a testbed.
The proposed benchmark can also serve as a testbed for measuring the capability of LLMs in problem discovery rather than just problem-solving.
arXiv Detail & Related papers (2024-06-11T14:02:23Z) - RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation.
Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z) - TnT-LLM: Text Mining at Scale with Large Language Models [24.731544646232962]
Large Language Models (LLMs) automate the process of end-to-end label generation and assignment with minimal human effort.
We show that TnT-LLM generates more accurate and relevant label when compared against state-of-the-art baselines.
We also share our practical experiences and insights on the challenges and opportunities of using LLMs for large-scale text mining in real-world applications.
arXiv Detail & Related papers (2024-03-18T18:45:28Z) - Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral.
This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z) - ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks.
To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z) - Open-source Large Language Models are Strong Zero-shot Query Likelihood
Models for Document Ranking [36.90911173089409]
Large language models (LLMs) have emerged as effective Query Likelihood Models (QLMs)
This paper focuses on investigating the genuine zero-shot ranking effectiveness of recent LLMs.
We introduce a novel state-of-the-art ranking system that integrates LLM-based QLMs with a hybrid zero-shot retriever.
arXiv Detail & Related papers (2023-10-20T02:54:42Z) - Not All Metrics Are Guilty: Improving NLG Evaluation by Diversifying References [123.39034752499076]
Div-Ref is a method to enhance evaluation benchmarks by enriching the number of references.
We conduct experiments to empirically demonstrate that diversifying the expression of reference can significantly enhance the correlation between automatic evaluation and human evaluation.
arXiv Detail & Related papers (2023-05-24T11:53:29Z) - Check Your Facts and Try Again: Improving Large Language Models with
External Knowledge and Automated Feedback [127.75419038610455]
Large language models (LLMs) are able to generate human-like, fluent responses for many downstream tasks.
This paper proposes a LLM-Augmenter system, which augments a black-box LLM with a set of plug-and-play modules.
arXiv Detail & Related papers (2023-02-24T18:48:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.