EditEval: An Instruction-Based Benchmark for Text Improvements
- URL: http://arxiv.org/abs/2209.13331v1
- Date: Tue, 27 Sep 2022 12:26:05 GMT
- Title: EditEval: An Instruction-Based Benchmark for Text Improvements
- Authors: Jane Dwivedi-Yu, Timo Schick, Zhengbao Jiang, Maria Lomeli, Patrick
Lewis, Gautier Izacard, Edouard Grave, Sebastian Riedel, Fabio Petroni
- Abstract summary: This work presents EditEval: An instruction-based, benchmark and evaluation suite for automatic evaluation of editing capabilities.
We evaluate several pre-trained models, which shows that InstructGPT and PEER perform the best, but that most baselines fall below the supervised SOTA.
Our analysis shows that commonly used metrics for editing tasks do not always correlate well, and that optimization for prompts with the highest performance does not necessarily entail the strongest robustness to different models.
- Score: 73.5918084416016
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Evaluation of text generation to date has primarily focused on content
created sequentially, rather than improvements on a piece of text. Writing,
however, is naturally an iterative and incremental process that requires
expertise in different modular skills such as fixing outdated information or
making the style more consistent. Even so, comprehensive evaluation of a
model's capacity to perform these skills and the ability to edit remains
sparse. This work presents EditEval: An instruction-based, benchmark and
evaluation suite that leverages high-quality existing and new datasets for
automatic evaluation of editing capabilities such as making text more cohesive
and paraphrasing. We evaluate several pre-trained models, which shows that
InstructGPT and PEER perform the best, but that most baselines fall below the
supervised SOTA, particularly when neutralizing and updating information. Our
analysis also shows that commonly used metrics for editing tasks do not always
correlate well, and that optimization for prompts with the highest performance
does not necessarily entail the strongest robustness to different models.
Through the release of this benchmark and a publicly available leaderboard
challenge, we hope to unlock future research in developing models capable of
iterative and more controllable editing.
Related papers
- EditBoard: Towards A Comprehensive Evaluation Benchmark for Text-based Video Editing Models [16.045012576543474]
Text-based video editing has emerged as a promising field, enabling precise modifications to videos based on text prompts.
Existing evaluations are limited and inconsistent, typically summarizing overall performance with a single score.
We propose EditBoard, the first comprehensive evaluation benchmark for text-based video editing models.
arXiv Detail & Related papers (2024-09-15T08:43:18Z) - Improving embedding with contrastive fine-tuning on small datasets with expert-augmented scores [12.86467344792873]
The proposed method uses soft labels derived from expert-augmented scores to fine-tune embedding models.
The paper evaluates the method using a Q&A dataset from an online shopping website and eight expert models.
arXiv Detail & Related papers (2024-08-19T01:59:25Z) - Retrieval is Accurate Generation [99.24267226311157]
We introduce a novel method that selects context-aware phrases from a collection of supporting documents.
Our model achieves the best performance and the lowest latency among several retrieval-augmented baselines.
arXiv Detail & Related papers (2024-02-27T14:16:19Z) - Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral.
This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - XATU: A Fine-grained Instruction-based Benchmark for Explainable Text Updates [7.660511135287692]
This paper introduces XATU, the first benchmark specifically designed for fine-grained instruction-based explainable text editing.
XATU considers finer-grained text editing tasks of varying difficulty, incorporating lexical, syntactic, semantic, and knowledge-intensive edit aspects.
We demonstrate the effectiveness of instruction tuning and the impact of underlying architecture across various editing tasks.
arXiv Detail & Related papers (2023-09-20T04:58:59Z) - Improving Iterative Text Revision by Learning Where to Edit from Other
Revision Tasks [11.495407637511878]
Iterative text revision improves text quality by fixing grammatical errors, rephrasing for better readability or contextual appropriateness, or reorganizing sentence structures throughout a document.
Most recent research has focused on understanding and classifying different types of edits in the iterative revision process from human-written text.
We aim to build an end-to-end text revision system that can iteratively generate helpful edits by explicitly detecting editable spans with their corresponding edit intents.
arXiv Detail & Related papers (2022-12-02T18:10:43Z) - Memory-Based Model Editing at Scale [102.28475739907498]
Existing model editors struggle to accurately model an edit's intended scope.
We propose Semi-Parametric Editing with a Retrieval-Augmented Counterfactual Model (SERAC)
SERAC stores edits in an explicit memory and learns to reason over them to modulate the base model's predictions as needed.
arXiv Detail & Related papers (2022-06-13T23:40:34Z) - Understanding Iterative Revision from Human-Written Text [10.714872525208385]
IteraTeR is the first large-scale, multi-domain, edit-intention annotated corpus of iteratively revised text.
We better understand the text revision process, making vital connections between edit intentions and writing quality.
arXiv Detail & Related papers (2022-03-08T01:47:42Z) - Text Editing by Command [82.50904226312451]
A prevailing paradigm in neural text generation is one-shot generation, where text is produced in a single step.
We address this limitation with an interactive text generation setting in which the user interacts with the system by issuing commands to edit existing text.
We show that our Interactive Editor, a transformer-based model trained on this dataset, outperforms baselines and obtains positive results in both automatic and human evaluations.
arXiv Detail & Related papers (2020-10-24T08:00:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.