Dancing Between Success and Failure: Edit-level Simplification
Evaluation using SALSA
- URL: http://arxiv.org/abs/2305.14458v2
- Date: Sun, 22 Oct 2023 18:25:46 GMT
- Title: Dancing Between Success and Failure: Edit-level Simplification
Evaluation using SALSA
- Authors: David Heineman, Yao Dou, Mounica Maddela, Wei Xu
- Abstract summary: We introduce SALSA, an edit-based human annotation framework.
We develop twenty one linguistically grounded edit types, covering the full spectrum of success and failure.
We develop LENS-SALSA, a reference-free automatic simplification metric, trained to predict sentence- and word-level quality simultaneously.
- Score: 21.147261039292026
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large language models (e.g., GPT-4) are uniquely capable of producing highly
rated text simplification, yet current human evaluation methods fail to provide
a clear understanding of systems' specific strengths and weaknesses. To address
this limitation, we introduce SALSA, an edit-based human annotation framework
that enables holistic and fine-grained text simplification evaluation. We
develop twenty one linguistically grounded edit types, covering the full
spectrum of success and failure across dimensions of conceptual, syntactic and
lexical simplicity. Using SALSA, we collect 19K edit annotations on 840
simplifications, revealing discrepancies in the distribution of simplification
strategies performed by fine-tuned models, prompted LLMs and humans, and find
GPT-3.5 performs more quality edits than humans, but still exhibits frequent
errors. Using our fine-grained annotations, we develop LENS-SALSA, a
reference-free automatic simplification metric, trained to predict sentence-
and word-level quality simultaneously. Additionally, we introduce word-level
quality estimation for simplification and report promising baseline results.
Our data, new metric, and annotation toolkit are available at
https://salsa-eval.com.
Related papers
- Localizing Factual Inconsistencies in Attributable Text Generation [91.981439746404]
We introduce QASemConsistency, a new formalism for localizing factual inconsistencies in attributable text generation.
We first demonstrate the effectiveness of the QASemConsistency methodology for human annotation.
We then implement several methods for automatically detecting localized factual inconsistencies.
arXiv Detail & Related papers (2024-10-09T22:53:48Z) - Analysing Zero-Shot Readability-Controlled Sentence Simplification [54.09069745799918]
We investigate how different types of contextual information affect a model's ability to generate sentences with the desired readability.
Results show that all tested models struggle to simplify sentences due to models' limitations and characteristics of the source sentences.
Our experiments also highlight the need for better automatic evaluation metrics tailored to RCTS.
arXiv Detail & Related papers (2024-09-30T12:36:25Z) - Detecting Errors through Ensembling Prompts (DEEP): An End-to-End LLM Framework for Detecting Factual Errors [11.07539342949602]
We propose an end-to-end framework for detecting factual errors in text summarization.
Our framework uses a diverse set of LLM prompts to identify factual inconsistencies.
We calibrate the ensembled models to produce empirically accurate probabilities that a text is factually consistent or free of hallucination.
arXiv Detail & Related papers (2024-06-18T18:59:37Z) - An In-depth Evaluation of GPT-4 in Sentence Simplification with
Error-based Human Assessment [10.816677544269782]
We design an error-based human annotation framework to assess the GPT-4's simplification capabilities.
Results show that GPT-4 generally generates fewer erroneous simplification outputs compared to the current state-of-the-art.
arXiv Detail & Related papers (2024-03-08T00:19:24Z) - Retrieval is Accurate Generation [99.24267226311157]
We introduce a novel method that selects context-aware phrases from a collection of supporting documents.
Our model achieves the best performance and the lowest latency among several retrieval-augmented baselines.
arXiv Detail & Related papers (2024-02-27T14:16:19Z) - An LLM-Enhanced Adversarial Editing System for Lexical Simplification [10.519804917399744]
Lexical Simplification aims to simplify text at the lexical level.
Existing methods rely heavily on annotated data.
We propose a novel LS method without parallel corpora.
arXiv Detail & Related papers (2024-02-22T17:04:30Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - Active Learning for Abstractive Text Summarization [50.79416783266641]
We propose the first effective query strategy for Active Learning in abstractive text summarization.
We show that using our strategy in AL annotation helps to improve the model performance in terms of ROUGE and consistency scores.
arXiv Detail & Related papers (2023-01-09T10:33:14Z) - Few-shot Instruction Prompts for Pretrained Language Models to Detect
Social Biases [55.45617404586874]
We propose a few-shot instruction-based method for prompting pre-trained language models (LMs)
We show that large LMs can detect different types of fine-grained biases with similar and sometimes superior accuracy to fine-tuned models.
arXiv Detail & Related papers (2021-12-15T04:19:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.