CoUDA: Coherence Evaluation via Unified Data Augmentation
- URL: http://arxiv.org/abs/2404.00681v1
- Date: Sun, 31 Mar 2024 13:19:36 GMT
- Title: CoUDA: Coherence Evaluation via Unified Data Augmentation
- Authors: Dawei Zhu, Wenhao Wu, Yifan Song, Fangwei Zhu, Ziqiang Cao, Sujian Li,
- Abstract summary: Coherence evaluation aims to assess the organization and structure of a discourse.
We take inspiration from linguistic theory of discourse structure, and propose a data augmentation framework named CoUDA.
With only 233M parameters, CoUDA achieves state-of-the-art performance in both pointwise scoring and pairwise ranking tasks.
- Score: 49.37157483044349
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Coherence evaluation aims to assess the organization and structure of a discourse, which remains challenging even in the era of large language models. Due to the scarcity of annotated data, data augmentation is commonly used for training coherence evaluation models. However, previous augmentations for this task primarily rely on heuristic rules, lacking designing criteria as guidance. In this paper, we take inspiration from linguistic theory of discourse structure, and propose a data augmentation framework named CoUDA. CoUDA breaks down discourse coherence into global and local aspects, and designs augmentation strategies for both aspects, respectively. Especially for local coherence, we propose a novel generative strategy for constructing augmentation samples, which involves post-pretraining a generative model and applying two controlling mechanisms to control the difficulty of generated samples. During inference, CoUDA also jointly evaluates both global and local aspects to comprehensively assess the overall coherence of a discourse. Extensive experiments in coherence evaluation show that, with only 233M parameters, CoUDA achieves state-of-the-art performance in both pointwise scoring and pairwise ranking tasks, even surpassing recent GPT-3.5 and GPT-4 based metrics.
Related papers
- Automated Speaking Assessment of Conversation Tests with Novel Graph-based Modeling on Spoken Response Coherence [11.217656140423207]
ASAC aims to evaluate the overall speaking proficiency of an L2 speaker in a setting where an interlocutor interacts with one or more candidates.
We propose a hierarchical graph model that aptly incorporates both broad inter-response interactions and nuanced semantic information.
Extensive experimental results on the NICT-JLE benchmark dataset suggest that our proposed modeling approach can yield considerable improvements in prediction accuracy.
arXiv Detail & Related papers (2024-09-11T07:24:07Z) - MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation [60.65820977963331]
We introduce a novel evaluation paradigm for Large Language Models (LLMs)
This paradigm shifts the emphasis from result-oriented assessments, which often neglect the reasoning process, to a more comprehensive evaluation.
By applying this paradigm in the GSM8K dataset, we have developed the MR-GSM8K benchmark.
arXiv Detail & Related papers (2023-12-28T15:49:43Z) - Coherent Entity Disambiguation via Modeling Topic and Categorical
Dependency [87.16283281290053]
Previous entity disambiguation (ED) methods adopt a discriminative paradigm, where prediction is made based on matching scores between mention context and candidate entities.
We propose CoherentED, an ED system equipped with novel designs aimed at enhancing the coherence of entity predictions.
We achieve new state-of-the-art results on popular ED benchmarks, with an average improvement of 1.3 F1 points.
arXiv Detail & Related papers (2023-11-06T16:40:13Z) - A Novel Computational and Modeling Foundation for Automatic Coherence Assessment [13.430637580980164]
Coherence is an essential property of well-written texts, that refers to the way textual units relate to one another.
In this work we employ the formal linguistic definition of citetReinhart:1980 of what makes a discourse coherent, consisting of three conditions -- em cohesion, consistency and em relevance -- and formalize these conditions as respective computational tasks.
On two benchmarks for coherence scoring rated by humans, one containing 500 automatically-generated short stories and another containing 4k real-world texts, our experiments confirm that jointly training on the proposed tasks leads to better
arXiv Detail & Related papers (2023-10-01T07:06:17Z) - Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z) - NICO++: Towards Better Benchmarking for Domain Generalization [44.11418240848957]
We propose a large-scale benchmark with extensive labeled domains named NICO++.
We show that NICO++ shows its superior evaluation capability compared with current DG datasets.
arXiv Detail & Related papers (2022-04-17T15:57:12Z) - Towards Quantifiable Dialogue Coherence Evaluation [126.55560816209756]
Quantifiable Dialogue Coherence Evaluation (QuantiDCE) is a novel framework aiming to train a quantifiable dialogue coherence metric.
QuantiDCE includes two training stages, Multi-Level Ranking (MLR) pre-training and Knowledge Distillation (KD) fine-tuning.
Experimental results show that the model trained by QuantiDCE presents stronger correlations with human judgements than the other state-of-the-art metrics.
arXiv Detail & Related papers (2021-06-01T14:11:17Z) - Knowledge-based Review Generation by Coherence Enhanced Text Planning [45.473253542837995]
We propose a novel Coherence Enhanced Text Planning model (CETP) based on knowledge graphs (KGs) to improve both global and local coherence for review generation.
For global coherence, we design a hierarchical self-attentive architecture with both subgraph- and node-level attention to enhance the correlations between subgraphs.
Experiments on three datasets confirm the effectiveness of our model on improving the content coherence of generated texts.
arXiv Detail & Related papers (2021-05-09T02:12:05Z) - Novel Human-Object Interaction Detection via Adversarial Domain
Generalization [103.55143362926388]
We study the problem of novel human-object interaction (HOI) detection, aiming at improving the generalization ability of the model to unseen scenarios.
The challenge mainly stems from the large compositional space of objects and predicates, which leads to the lack of sufficient training data for all the object-predicate combinations.
We propose a unified framework of adversarial domain generalization to learn object-invariant features for predicate prediction.
arXiv Detail & Related papers (2020-05-22T22:02:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.