Is Automated Topic Model Evaluation Broken?: The Incoherence of
Coherence
- URL: http://arxiv.org/abs/2107.02173v1
- Date: Mon, 5 Jul 2021 17:58:52 GMT
- Title: Is Automated Topic Model Evaluation Broken?: The Incoherence of
Coherence
- Authors: Alexander Hoyle, Pranav Goel, Denis Peskov, Andrew Hian-Cheong, Jordan
Boyd-Graber, Philip Resnik
- Abstract summary: We look at the standardization gap and the validation gap in topic model evaluation.
Recent models relying on neural components surpass classical topic models according to these metrics.
We use automatic coherence along with the two most widely accepted human judgment tasks, namely, topic rating and word intrusion.
- Score: 62.826466543958624
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Topic model evaluation, like evaluation of other unsupervised methods, can be
contentious. However, the field has coalesced around automated estimates of
topic coherence, which rely on the frequency of word co-occurrences in a
reference corpus. Recent models relying on neural components surpass classical
topic models according to these metrics. At the same time, unlike classical
models, the practice of neural topic model evaluation suffers from a validation
gap: automatic coherence for neural models has not been validated using human
experimentation. In addition, as we show via a meta-analysis of topic modeling
literature, there is a substantial standardization gap in the use of automated
topic modeling benchmarks. We address both the standardization gap and the
validation gap. Using two of the most widely used topic model evaluation
datasets, we assess a dominant classical model and two state-of-the-art neural
models in a systematic, clearly documented, reproducible way. We use automatic
coherence along with the two most widely accepted human judgment tasks, namely,
topic rating and word intrusion. Automated evaluation will declare one model
significantly different from another when corresponding human evaluations do
not, calling into question the validity of fully automatic evaluations
independent of human judgments.
Related papers
- Improving the TENOR of Labeling: Re-evaluating Topic Models for Content
Analysis [5.757610495733924]
We conduct the first evaluation of neural, supervised and classical topic models in an interactive task based setting.
We show that current automated metrics do not provide a complete picture of topic modeling capabilities.
arXiv Detail & Related papers (2024-01-29T17:54:04Z) - Are Neural Topic Models Broken? [81.15470302729638]
We study the relationship between automated and human evaluation of topic models.
We find that neural topic models fare worse in both respects compared to an established classical method.
arXiv Detail & Related papers (2022-10-28T14:38:50Z) - Incorporating Casual Analysis into Diversified and Logical Response
Generation [14.4586344491264]
Conditional Variational AutoEncoder (CVAE) model can generate more diversified responses than the traditional Seq2Seq model.
We propose to predict the mediators to preserve relevant information and auto-regressively incorporate the mediators into generating process.
arXiv Detail & Related papers (2022-09-20T05:51:11Z) - Have you tried Neural Topic Models? Comparative Analysis of Neural and
Non-Neural Topic Models with Application to COVID-19 Twitter Data [11.199249808462458]
We conduct a comparative study examining state-of-the-art neural versus non-neural topic models.
We show that neural topic models outperform their classical counterparts on standard evaluation metrics.
We also propose a novel regularization term for neural topic models, which is designed to address the well-documented problem of mode collapse.
arXiv Detail & Related papers (2021-05-21T07:24:09Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z) - Firearm Detection via Convolutional Neural Networks: Comparing a
Semantic Segmentation Model Against End-to-End Solutions [68.8204255655161]
Threat detection of weapons and aggressive behavior from live video can be used for rapid detection and prevention of potentially deadly incidents.
One way for achieving this is through the use of artificial intelligence and, in particular, machine learning for image analysis.
We compare a traditional monolithic end-to-end deep learning model and a previously proposed model based on an ensemble of simpler neural networks detecting fire-weapons via semantic segmentation.
arXiv Detail & Related papers (2020-12-17T15:19:29Z) - On the Transferability of Adversarial Attacksagainst Neural Text
Classifier [121.6758865857686]
We investigate the transferability of adversarial examples for text classification models.
We propose a genetic algorithm to find an ensemble of models that can induce adversarial examples to fool almost all existing models.
We derive word replacement rules that can be used for model diagnostics from these adversarial examples.
arXiv Detail & Related papers (2020-11-17T10:45:05Z) - Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring
Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics.
We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z) - Speaker Sensitive Response Evaluation Model [17.381658875470638]
We propose an automatic evaluation model based on the similarity of the generated response with the conversational context.
We learn the model parameters from an unlabeled conversation corpus.
We show that our model can be applied to movie dialogues without any additional training.
arXiv Detail & Related papers (2020-06-12T08:59:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.