Topic Modelling: Going Beyond Token Outputs
- URL: http://arxiv.org/abs/2401.12990v1
- Date: Tue, 16 Jan 2024 16:05:54 GMT
- Title: Topic Modelling: Going Beyond Token Outputs
- Authors: Lowri Williams, Eirini Anthi, Laura Arman, Pete Burnap
- Abstract summary: This paper presents a novel approach towards extending the output of traditional topic modelling methods beyond a list of isolated tokens.
To measure the interpretability of the proposed outputs against those of the traditional topic modelling approach, independent annotators manually scored each output.
- Score: 3.072340427031969
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Topic modelling is a text mining technique for identifying salient themes
from a number of documents. The output is commonly a set of topics consisting
of isolated tokens that often co-occur in such documents. Manual effort is
often associated with interpreting a topic's description from such tokens.
However, from a human's perspective, such outputs may not adequately provide
enough information to infer the meaning of the topics; thus, their
interpretability is often inaccurately understood. Although several studies
have attempted to automatically extend topic descriptions as a means of
enhancing the interpretation of topic models, they rely on external language
sources that may become unavailable, must be kept up-to-date to generate
relevant results, and present privacy issues when training on or processing
data. This paper presents a novel approach towards extending the output of
traditional topic modelling methods beyond a list of isolated tokens. This
approach removes the dependence on external sources by using the textual data
itself by extracting high-scoring keywords and mapping them to the topic
model's token outputs. To measure the interpretability of the proposed outputs
against those of the traditional topic modelling approach, independent
annotators manually scored each output based on their quality and usefulness,
as well as the efficiency of the annotation task. The proposed approach
demonstrated higher quality and usefulness, as well as higher efficiency in the
annotation task, in comparison to the outputs of a traditional topic modelling
method, demonstrating an increase in their interpretability.
Related papers
- CAST: Corpus-Aware Self-similarity Enhanced Topic modelling [16.562349140796115]
We introduce CAST: Corpus-Aware Self-similarity Enhanced Topic modelling, a novel topic modelling method.
We find self-similarity to be an effective metric to prevent functional words from acting as candidate topic words.
Our approach significantly enhances the coherence and diversity of generated topics, as well as the topic model's ability to handle noisy data.
arXiv Detail & Related papers (2024-10-19T15:27:11Z) - Prompting Large Language Models for Topic Modeling [10.31712610860913]
We propose PromptTopic, a novel topic modeling approach that harnesses the advanced language understanding of large language models (LLMs)
It involves extracting topics at the sentence level from individual documents, then aggregating and condensing these topics into a predefined quantity, ultimately providing coherent topics for texts of varying lengths.
We benchmark PromptTopic against the state-of-the-art baselines on three vastly diverse datasets, establishing its proficiency in discovering meaningful topics.
arXiv Detail & Related papers (2023-12-15T11:15:05Z) - Topics in the Haystack: Extracting and Evaluating Topics beyond
Coherence [0.0]
We propose a method that incorporates a deeper understanding of both sentence and document themes.
This allows our model to detect latent topics that may include uncommon words or neologisms.
We present correlation coefficients with human identification of intruder words and achieve near-human level results at the word-intrusion task.
arXiv Detail & Related papers (2023-03-30T12:24:25Z) - Modeling Entities as Semantic Points for Visual Information Extraction
in the Wild [55.91783742370978]
We propose an alternative approach to precisely and robustly extract key information from document images.
We explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities.
The proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-23T08:21:16Z) - Author Clustering and Topic Estimation for Short Texts [69.54017251622211]
We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document.
We also simultaneously cluster users, removing the need for post-hoc cluster estimation.
Our method performs as well as -- or better -- than traditional approaches to problems arising in short text.
arXiv Detail & Related papers (2021-06-15T20:55:55Z) - Unsupervised Graph-based Topic Modeling from Video Transcriptions [5.210353244951637]
We develop a topic extractor on video transcriptions using neural word embeddings and a graph-based clustering method.
Experimental results on the real-life multimodal data set MuSe-CaR demonstrate that our approach extracts coherent and meaningful topics.
arXiv Detail & Related papers (2021-05-04T12:48:17Z) - Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious.
We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z) - Generative Counterfactuals for Neural Networks via Attribute-Informed
Perturbation [51.29486247405601]
We design a framework to generate counterfactuals for raw data instances with the proposed Attribute-Informed Perturbation (AIP)
By utilizing generative models conditioned with different attributes, counterfactuals with desired labels can be obtained effectively and efficiently.
Experimental results on real-world texts and images demonstrate the effectiveness, sample quality as well as efficiency of our designed framework.
arXiv Detail & Related papers (2021-01-18T08:37:13Z) - Lexically-constrained Text Generation through Commonsense Knowledge
Extraction and Injection [62.071938098215085]
We focus on the Commongen benchmark, wherein the aim is to generate a plausible sentence for a given set of input concepts.
We propose strategies for enhancing the semantic correctness of the generated text.
arXiv Detail & Related papers (2020-12-19T23:23:40Z) - Few-Shot Learning for Opinion Summarization [117.70510762845338]
Opinion summarization is the automatic creation of text reflecting subjective information expressed in multiple documents.
In this work, we show that even a handful of summaries is sufficient to bootstrap generation of the summary text.
Our approach substantially outperforms previous extractive and abstractive methods in automatic and human evaluation.
arXiv Detail & Related papers (2020-04-30T15:37:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.