Rethinking Coherence Modeling: Synthetic vs. Downstream Tasks
- URL: http://arxiv.org/abs/2004.14626v2
- Date: Sun, 14 Feb 2021 04:47:45 GMT
- Title: Rethinking Coherence Modeling: Synthetic vs. Downstream Tasks
- Authors: Tasnim Mohiuddin, Prathyusha Jwalapuram, Xiang Lin, and Shafiq Joty
- Abstract summary: Coherence models are typically evaluated only on synthetic tasks, which may not be representative of their performance in downstream applications.
We conduct experiments on benchmarking well-known traditional and neural coherence models on synthetic sentence ordering tasks.
Our results demonstrate a weak correlation between the model performances in the synthetic tasks and the downstream applications.
- Score: 15.044192886215887
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Although coherence modeling has come a long way in developing novel models,
their evaluation on downstream applications for which they are purportedly
developed has largely been neglected. With the advancements made by neural
approaches in applications such as machine translation (MT), summarization and
dialog systems, the need for coherence evaluation of these tasks is now more
crucial than ever. However, coherence models are typically evaluated only on
synthetic tasks, which may not be representative of their performance in
downstream applications. To investigate how representative the synthetic tasks
are of downstream use cases, we conduct experiments on benchmarking well-known
traditional and neural coherence models on synthetic sentence ordering tasks,
and contrast this with their performance on three downstream applications:
coherence evaluation for MT and summarization, and next utterance prediction in
retrieval-based dialog. Our results demonstrate a weak correlation between the
model performances in the synthetic tasks and the downstream applications,
{motivating alternate training and evaluation methods for coherence models.
Related papers
- When is an Embedding Model More Promising than Another? [33.540506562970776]
Embedders play a central role in machine learning, projecting any object into numerical representations that can be leveraged to perform various downstream tasks.
The evaluation of embedding models typically depends on domain-specific empirical approaches.
We present a unified approach to evaluate embedders, drawing upon the concepts of sufficiency and informativeness.
arXiv Detail & Related papers (2024-06-11T18:13:46Z) - Improving the TENOR of Labeling: Re-evaluating Topic Models for Content
Analysis [5.757610495733924]
We conduct the first evaluation of neural, supervised and classical topic models in an interactive task based setting.
We show that current automated metrics do not provide a complete picture of topic modeling capabilities.
arXiv Detail & Related papers (2024-01-29T17:54:04Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - Tapping the Potential of Coherence and Syntactic Features in Neural
Models for Automatic Essay Scoring [16.24421485426685]
We propose a novel approach to extract and represent essay coherence features with prompt-learning NSP.
We apply syntactic feature dense embedding to augment BERT-based model and achieve the best performance for hybrid methodology for AES.
arXiv Detail & Related papers (2022-11-24T02:00:03Z) - Evaluation of Categorical Generative Models -- Bridging the Gap Between
Real and Synthetic Data [18.142397311464343]
We introduce an appropriately scalable evaluation method for generative models.
We consider increasingly large probability spaces, which correspond to increasingly difficult modeling tasks.
We validate our evaluation procedure with synthetic experiments on both synthetic generative models and current state-of-the-art categorical generative models.
arXiv Detail & Related papers (2022-10-28T21:05:25Z) - Are Neural Topic Models Broken? [81.15470302729638]
We study the relationship between automated and human evaluation of topic models.
We find that neural topic models fare worse in both respects compared to an established classical method.
arXiv Detail & Related papers (2022-10-28T14:38:50Z) - SynBench: Task-Agnostic Benchmarking of Pretrained Representations using
Synthetic Data [78.21197488065177]
Recent success in fine-tuning large models, that are pretrained on broad data at scale, on downstream tasks has led to a significant paradigm shift in deep learning.
This paper proposes a new task-agnostic framework, textitSynBench, to measure the quality of pretrained representations using synthetic data.
arXiv Detail & Related papers (2022-10-06T15:25:00Z) - Rethinking Self-Supervision Objectives for Generalizable Coherence
Modeling [8.329870357145927]
Coherence evaluation of machine generated text is one of the principal applications of coherence models that needs to be investigated.
We explore training data and self-supervision objectives that result in a model that generalizes well across tasks.
We show empirically that increasing the density of negative samples improves the basic model, and using a global negative queue further improves and stabilizes the model while training with hard negative samples.
arXiv Detail & Related papers (2021-10-14T07:44:14Z) - Is Automated Topic Model Evaluation Broken?: The Incoherence of
Coherence [62.826466543958624]
We look at the standardization gap and the validation gap in topic model evaluation.
Recent models relying on neural components surpass classical topic models according to these metrics.
We use automatic coherence along with the two most widely accepted human judgment tasks, namely, topic rating and word intrusion.
arXiv Detail & Related papers (2021-07-05T17:58:52Z) - On the model-based stochastic value gradient for continuous
reinforcement learning [50.085645237597056]
We show that simple model-based agents can outperform state-of-the-art model-free agents in terms of both sample-efficiency and final reward.
Our findings suggest that model-based policy evaluation deserves closer attention.
arXiv Detail & Related papers (2020-08-28T17:58:29Z) - Estimating the Effects of Continuous-valued Interventions using
Generative Adversarial Networks [103.14809802212535]
We build on the generative adversarial networks (GANs) framework to address the problem of estimating the effect of continuous-valued interventions.
Our model, SCIGAN, is flexible and capable of simultaneously estimating counterfactual outcomes for several different continuous interventions.
To address the challenges presented by shifting to continuous interventions, we propose a novel architecture for our discriminator.
arXiv Detail & Related papers (2020-02-27T18:46:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.