CoheSentia: A Novel Benchmark of Incremental versus Holistic Assessment
of Coherence in Generated Texts
- URL: http://arxiv.org/abs/2310.16329v1
- Date: Wed, 25 Oct 2023 03:21:20 GMT
- Title: CoheSentia: A Novel Benchmark of Incremental versus Holistic Assessment
of Coherence in Generated Texts
- Authors: Aviya Maimon and Reut Tsarfaty
- Abstract summary: We introduce sc CoheSentia, a novel benchmark of human-perceived coherence of automatically generated texts.
Our benchmark contains 500 automatically-generated and human-annotated paragraphs, each annotated in both methods.
Our analysis shows that the inter-annotator agreement in the incremental mode is higher than in the holistic alternative.
- Score: 15.866519123942457
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Coherence is a linguistic term that refers to the relations between small
textual units (sentences, propositions), which make the text logically
consistent and meaningful to the reader. With the advances of generative
foundational models in NLP, there is a pressing need to automatically assess
the human-perceived coherence of automatically generated texts. Up until now,
little work has been done on explicitly assessing the coherence of generated
texts and analyzing the factors contributing to (in)coherence. Previous work on
the topic used other tasks, e.g., sentence reordering, as proxies of coherence,
rather than approaching coherence detection heads on. In this paper, we
introduce {\sc CoheSentia}, a novel benchmark of human-perceived coherence of
automatically generated texts. Our annotation protocol reflects two
perspectives; one is global, assigning a single coherence score, and the other
is incremental, scoring sentence by sentence. The incremental method produces
an (in)coherence score for each text fragment and also pinpoints reasons for
incoherence at that point. Our benchmark contains 500 automatically-generated
and human-annotated paragraphs, each annotated in both methods, by multiple
raters. Our analysis shows that the inter-annotator agreement in the
incremental mode is higher than in the holistic alternative, and our
experiments show that standard LMs fine-tuned for coherence detection show
varied performance on the different factors contributing to (in)coherence. All
in all, these models yield unsatisfactory performance, emphasizing the need for
developing more reliable methods for coherence assessment.
Related papers
- Localizing Factual Inconsistencies in Attributable Text Generation [91.981439746404]
We introduce QASemConsistency, a new formalism for localizing factual inconsistencies in attributable text generation.
We first demonstrate the effectiveness of the QASemConsistency methodology for human annotation.
We then implement several methods for automatically detecting localized factual inconsistencies.
arXiv Detail & Related papers (2024-10-09T22:53:48Z) - BBScore: A Brownian Bridge Based Metric for Assessing Text Coherence [20.507596002357655]
Coherent texts inherently manifest a sequential and cohesive interplay among sentences.
BBScore is a reference-free metric grounded in Brownian bridge theory for assessing text coherence.
arXiv Detail & Related papers (2023-12-28T08:34:17Z) - Language Model Decoding as Direct Metrics Optimization [87.68281625776282]
Current decoding methods struggle to generate texts that align with human texts across different aspects.
In this work, we frame decoding from a language model as an optimization problem with the goal of strictly matching the expected performance with human texts.
We prove that this induced distribution is guaranteed to improve the perplexity on human texts, which suggests a better approximation to the underlying distribution of human texts.
arXiv Detail & Related papers (2023-10-02T09:35:27Z) - A Novel Computational and Modeling Foundation for Automatic Coherence Assessment [13.430637580980164]
Coherence is an essential property of well-written texts, that refers to the way textual units relate to one another.
In this work we employ the formal linguistic definition of citetReinhart:1980 of what makes a discourse coherent, consisting of three conditions -- em cohesion, consistency and em relevance -- and formalize these conditions as respective computational tasks.
On two benchmarks for coherence scoring rated by humans, one containing 500 automatically-generated short stories and another containing 4k real-world texts, our experiments confirm that jointly training on the proposed tasks leads to better
arXiv Detail & Related papers (2023-10-01T07:06:17Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - SNaC: Coherence Error Detection for Narrative Summarization [73.48220043216087]
We introduce SNaC, a narrative coherence evaluation framework rooted in fine-grained annotations for long summaries.
We develop a taxonomy of coherence errors in generated narrative summaries and collect span-level annotations for 6.6k sentences across 150 book and movie screenplay summaries.
Our work provides the first characterization of coherence errors generated by state-of-the-art summarization models and a protocol for eliciting coherence judgments from crowd annotators.
arXiv Detail & Related papers (2022-05-19T16:01:47Z) - Compression, Transduction, and Creation: A Unified Framework for
Evaluating Natural Language Generation [85.32991360774447]
Natural language generation (NLG) spans a broad range of tasks, each of which serves for specific objectives.
We propose a unifying perspective based on the nature of information change in NLG tasks.
We develop a family of interpretable metrics that are suitable for evaluating key aspects of different NLG tasks.
arXiv Detail & Related papers (2021-09-14T01:00:42Z) - Lexically-constrained Text Generation through Commonsense Knowledge
Extraction and Injection [62.071938098215085]
We focus on the Commongen benchmark, wherein the aim is to generate a plausible sentence for a given set of input concepts.
We propose strategies for enhancing the semantic correctness of the generated text.
arXiv Detail & Related papers (2020-12-19T23:23:40Z) - Hierarchical Text Interaction for Rating Prediction [8.400688907233398]
We propose a novel Hierarchical Text Interaction model for rating prediction.
We exploit semantic correlations between each user-item pair at different hierarchies.
Experiments on five real-world datasets demonstrate that HTI outperforms state-of-the-art models by a large margin.
arXiv Detail & Related papers (2020-10-15T09:52:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.