Topics in the Haystack: Extracting and Evaluating Topics beyond
Coherence
- URL: http://arxiv.org/abs/2303.17324v1
- Date: Thu, 30 Mar 2023 12:24:25 GMT
- Title: Topics in the Haystack: Extracting and Evaluating Topics beyond
Coherence
- Authors: Anton Thielmann, Quentin Seifert, Arik Reuter, Elisabeth Bergherr,
Benjamin S\"afken
- Abstract summary: We propose a method that incorporates a deeper understanding of both sentence and document themes.
This allows our model to detect latent topics that may include uncommon words or neologisms.
We present correlation coefficients with human identification of intruder words and achieve near-human level results at the word-intrusion task.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Extracting and identifying latent topics in large text corpora has gained
increasing importance in Natural Language Processing (NLP). Most models,
whether probabilistic models similar to Latent Dirichlet Allocation (LDA) or
neural topic models, follow the same underlying approach of topic
interpretability and topic extraction. We propose a method that incorporates a
deeper understanding of both sentence and document themes, and goes beyond
simply analyzing word frequencies in the data. This allows our model to detect
latent topics that may include uncommon words or neologisms, as well as words
not present in the documents themselves. Additionally, we propose several new
evaluation metrics based on intruder words and similarity measures in the
semantic space. We present correlation coefficients with human identification
of intruder words and achieve near-human level results at the word-intrusion
task. We demonstrate the competitive performance of our method with a large
benchmark study, and achieve superior results compared to state-of-the-art
topic modeling and document clustering models.
Related papers
- CAST: Corpus-Aware Self-similarity Enhanced Topic modelling [16.562349140796115]
We introduce CAST: Corpus-Aware Self-similarity Enhanced Topic modelling, a novel topic modelling method.
We find self-similarity to be an effective metric to prevent functional words from acting as candidate topic words.
Our approach significantly enhances the coherence and diversity of generated topics, as well as the topic model's ability to handle noisy data.
arXiv Detail & Related papers (2024-10-19T15:27:11Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - A Unified Understanding of Deep NLP Models for Text Classification [88.35418976241057]
We have developed a visual analysis tool, DeepNLPVis, to enable a unified understanding of NLP models for text classification.
The key idea is a mutual information-based measure, which provides quantitative explanations on how each layer of a model maintains the information of input words in a sample.
A multi-level visualization, which consists of a corpus-level, a sample-level, and a word-level visualization, supports the analysis from the overall training set to individual samples.
arXiv Detail & Related papers (2022-06-19T08:55:07Z) - Author Clustering and Topic Estimation for Short Texts [69.54017251622211]
We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document.
We also simultaneously cluster users, removing the need for post-hoc cluster estimation.
Our method performs as well as -- or better -- than traditional approaches to problems arising in short text.
arXiv Detail & Related papers (2021-06-15T20:55:55Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - A Neural Generative Model for Joint Learning Topics and Topic-Specific
Word Embeddings [42.87769996249732]
We propose a novel generative model to explore both local and global context for joint learning topics and topic-specific word embeddings.
The trained model maps words to topic-dependent embeddings, which naturally addresses the issue of word polysemy.
arXiv Detail & Related papers (2020-08-11T13:54:11Z) - A Survey on Text Classification: From Shallow to Deep Learning [83.47804123133719]
The last decade has seen a surge of research in this area due to the unprecedented success of deep learning.
This paper fills the gap by reviewing the state-of-the-art approaches from 1961 to 2021.
We create a taxonomy for text classification according to the text involved and the models used for feature extraction and classification.
arXiv Detail & Related papers (2020-08-02T00:09:03Z) - Explainable and Discourse Topic-aware Neural Language Understanding [22.443597046878086]
Marrying topic models and language models exposes language understanding to a broader source of document-level context beyond sentences.
Existing approaches incorporate latent document topic proportions and ignore topical discourse in sentences of the document.
We present a novel neural composite language model that exploits both the latent and explainable topics along with topical discourse at sentence-level.
arXiv Detail & Related papers (2020-06-18T15:53:58Z) - Keyword Assisted Topic Models [0.0]
We show that providing a small number of keywords can substantially enhance the measurement performance of topic models.
KeyATM provides more interpretable results, has better document classification performance, and is less sensitive to the number of topics than the standard topic models.
arXiv Detail & Related papers (2020-04-13T14:35:28Z) - How Far are We from Effective Context Modeling? An Exploratory Study on
Semantic Parsing in Context [59.13515950353125]
We present a grammar-based decoding semantic parsing and adapt typical context modeling methods on top of it.
We evaluate 13 context modeling methods on two large cross-domain datasets, and our best model achieves state-of-the-art performances.
arXiv Detail & Related papers (2020-02-03T11:28:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.