Scarecrow: A Framework for Scrutinizing Machine Text
- URL: http://arxiv.org/abs/2107.01294v1
- Date: Fri, 2 Jul 2021 22:37:03 GMT
- Title: Scarecrow: A Framework for Scrutinizing Machine Text
- Authors: Yao Dou, Maxwell Forbes, Rik Koncel-Kedziorski, Noah A.Smith, Yejin
Choi
- Abstract summary: We introduce a new structured, crowdsourced error annotation schema called Scarecrow.
Scarecrow collects 13k annotations of 1.3k human and machine generate paragraphs of English language news text.
These findings demonstrate the value of Scarecrow annotations in the assessment of current and future text generation systems.
- Score: 69.26985439191151
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern neural text generation systems can produce remarkably fluent and
grammatical texts. While earlier language models suffered from repetition and
syntactic errors, the errors made by contemporary models are often semantic,
narrative, or discourse failures.
To facilitate research of these complex error types, we introduce a new
structured, crowdsourced error annotation schema called Scarecrow. The error
categories used in Scarecrow -- such as redundancy, commonsense errors, and
incoherence -- were identified by combining expert analysis with several pilot
rounds of ontology-free crowd annotation to arrive at a schema which covers the
error phenomena found in real machine generated text.
We use Scarecrow to collect 13k annotations of 1.3k human and machine
generate paragraphs of English language news text, amounting to over 41k spans
each labeled with its error category, severity, a natural language explanation,
and antecedent span (where relevant). We collect annotations for text generated
by state-of-the-art systems with varying known performance levels, from GPT-2
Small through the largest GPT-3. We isolate several factors for detailed
analysis, including parameter count, training data, and decoding technique. Our
results show both expected and surprising differences across these settings.
These findings demonstrate the value of Scarecrow annotations in the assessment
of current and future text generation systems. We release our complete
annotation toolkit and dataset at https://yao-dou.github.io/scarecrow/.
Related papers
- Harlequin: Color-driven Generation of Synthetic Data for Referring Expression Comprehension [4.164728134421114]
Referring Expression (REC) aims to identify a particular object in a scene by a natural language expression, and is an important topic in visual language understanding.
State-of-the-art methods for this task are based on deep learning, which generally requires expensive and manually labeled annotations.
We propose a novel framework that generates artificial data for the REC task, taking into account both textual and visual modalities.
arXiv Detail & Related papers (2024-11-22T09:08:36Z) - MISMATCH: Fine-grained Evaluation of Machine-generated Text with
Mismatch Error Types [68.76742370525234]
We propose a new evaluation scheme to model human judgments in 7 NLP tasks, based on the fine-grained mismatches between a pair of texts.
Inspired by the recent efforts in several NLP tasks for fine-grained evaluation, we introduce a set of 13 mismatch error types.
We show that the mismatch errors between the sentence pairs on the held-out datasets from 7 NLP tasks align well with the human evaluation.
arXiv Detail & Related papers (2023-06-18T01:38:53Z) - Towards Fine-Grained Information: Identifying the Type and Location of
Translation Errors [80.22825549235556]
Existing approaches can not synchronously consider error position and type.
We build an FG-TED model to predict the textbf addition and textbfomission errors.
Experiments show that our model can identify both error type and position concurrently, and gives state-of-the-art results.
arXiv Detail & Related papers (2023-02-17T16:20:33Z) - On the Blind Spots of Model-Based Evaluation Metrics for Text Generation [79.01422521024834]
We explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics.
We design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores.
Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics.
arXiv Detail & Related papers (2022-12-20T06:24:25Z) - Annotation Error Detection: Analyzing the Past and Present for a More
Coherent Future [63.99570204416711]
We reimplement 18 methods for detecting potential annotation errors and evaluate them on 9 English datasets.
We define a uniform evaluation setup including a new formalization of the annotation error detection task.
We release our datasets and implementations in an easy-to-use and open source software package.
arXiv Detail & Related papers (2022-06-05T22:31:45Z) - Neural Text Generation with Artificial Negative Examples [7.187858820534111]
We propose to suppress an arbitrary type of errors by training the text generation model in a reinforcement learning framework.
We use a trainable reward function that is capable of discriminating between references and sentences containing the targeted type of errors.
The experimental results show that our method can suppress the generation errors and achieve significant improvements on two machine translation and two image captioning tasks.
arXiv Detail & Related papers (2020-12-28T07:25:10Z) - On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data.
Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.