TopicTag: Automatic Annotation of NMF Topic Models Using Chain of Thought and Prompt Tuning with LLMs
- URL: http://arxiv.org/abs/2407.19616v1
- Date: Mon, 29 Jul 2024 00:18:17 GMT
- Title: TopicTag: Automatic Annotation of NMF Topic Models Using Chain of Thought and Prompt Tuning with LLMs
- Authors: Selma Wanna, Ryan Barron, Nick Solovyev, Maksim E. Eren, Manish Bhattarai, Kim Rasmussen, Boian S. Alexandrov,
- Abstract summary: Non-negative matrix factorization (NMF) is a common unsupervised approach that decomposes a term frequency-inverse document frequency (TF-IDF) matrix to uncover latent topics.
We present a methodology for automating topic labeling in documents clustered via NMF with automatic model determination (NMFk)
By leveraging the output of NMFk and employing prompt engineering, we utilize large language models (LLMs) to generate accurate topic labels.
- Score: 1.1826529992155377
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Topic modeling is a technique for organizing and extracting themes from large collections of unstructured text. Non-negative matrix factorization (NMF) is a common unsupervised approach that decomposes a term frequency-inverse document frequency (TF-IDF) matrix to uncover latent topics and segment the dataset accordingly. While useful for highlighting patterns and clustering documents, NMF does not provide explicit topic labels, necessitating subject matter experts (SMEs) to assign labels manually. We present a methodology for automating topic labeling in documents clustered via NMF with automatic model determination (NMFk). By leveraging the output of NMFk and employing prompt engineering, we utilize large language models (LLMs) to generate accurate topic labels. Our case study on over 34,000 scientific abstracts on Knowledge Graphs demonstrates the effectiveness of our method in enhancing knowledge management and document organization.
Related papers
- M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization? [49.53982792497275]
We investigate whether Large Vision-Language Models (LVLMs) genuinely comprehend interleaved image-text in the document.
Existing document understanding benchmarks often assess LVLMs using question-answer formats.
We introduce a novel and challenging Multimodal Document Summarization Benchmark (M-DocSum-Bench)
M-DocSum-Bench comprises 500 high-quality arXiv papers, along with interleaved multimodal summaries aligned with human preferences.
arXiv Detail & Related papers (2025-03-27T07:28:32Z) - Exploring Topic Trends in COVID-19 Research Literature using Non-Negative Matrix Factorization [2.8777530051393314]
We apply topic modeling using Non-Negative Matrix Factorization (NMF) on the COVID-19 Open Research dataset.
NMF factorizes the document-term matrix into two non-negative matrices, effectively representing the topics and their distribution across the documents.
Our findings contribute to the understanding of the knowledge structure of the COVID-19 research landscape.
arXiv Detail & Related papers (2025-03-23T19:37:52Z) - Order-agnostic Identifier for Large Language Model-based Generative Recommendation [94.37662915542603]
Items are assigned identifiers for Large Language Models (LLMs) to encode user history and generate the next item.
Existing approaches leverage either token-sequence identifiers, representing items as discrete token sequences, or single-token identifiers, using ID or semantic embeddings.
We propose SETRec, which leverages semantic tokenizers to obtain order-agnostic multi-dimensional tokens.
arXiv Detail & Related papers (2025-02-15T15:25:38Z) - Using LLM-Based Approaches to Enhance and Automate Topic Labeling [13.581341206178525]
This study explores the use of Large Language Models (LLMs) to automate and enhance topic labeling.
After applying BERTopic for topic modeling, we explore different approaches to select keywords and document summaries within each topic.
Each approach prioritizes different aspects, such as dominant themes or diversity, to assess their impact on label quality.
arXiv Detail & Related papers (2025-02-03T08:07:05Z) - STORE: Streamlining Semantic Tokenization and Generative Recommendation with A Single LLM [59.08493154172207]
We propose a unified framework to streamline the semantic tokenization and generative recommendation process.
We formulate semantic tokenization as a text-to-token task and generative recommendation as a token-to-token task, supplemented by a token-to-text reconstruction task and a text-to-token auxiliary task.
All these tasks are framed in a generative manner and trained using a single large language model (LLM) backbone.
arXiv Detail & Related papers (2024-09-11T13:49:48Z) - Beyond Mask: Rethinking Guidance Types in Few-shot Segmentation [67.35274834837064]
We develop a universal vision-language framework (UniFSS) to integrate prompts from text, mask, box, and image.
UniFSS significantly outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2024-07-16T08:41:01Z) - A Process for Topic Modelling Via Word Embeddings [0.0]
This work combines algorithms based on word embeddings, dimensionality reduction, and clustering.
The objective is to obtain topics from a set of unclassified texts.
arXiv Detail & Related papers (2023-10-06T15:10:35Z) - KMF: Knowledge-Aware Multi-Faceted Representation Learning for Zero-Shot
Node Classification [75.95647590619929]
Zero-Shot Node Classification (ZNC) has been an emerging and crucial task in graph data analysis.
We propose a Knowledge-Aware Multi-Faceted framework (KMF) that enhances the richness of label semantics.
A novel geometric constraint is developed to alleviate the problem of prototype drift caused by node information aggregation.
arXiv Detail & Related papers (2023-08-15T02:38:08Z) - FETA: Towards Specializing Foundation Models for Expert Task
Applications [49.57393504125937]
Foundation Models (FMs) have demonstrated unprecedented capabilities including zero-shot learning, high fidelity data synthesis, and out of domain generalization.
We show in this paper that FMs still have poor out-of-the-box performance on expert tasks.
We propose a first of its kind FETA benchmark built around the task of teaching FMs to understand technical documentation.
arXiv Detail & Related papers (2022-09-08T08:47:57Z) - Federated Non-negative Matrix Factorization for Short Texts Topic
Modeling with Mutual Information [43.012719398648144]
This paper proposes a Federated NMF (FedNMF) framework, which allows multiple clients to collaboratively train a high-quality NMF based topic model with locally stored data.
Experimental results show that our FedNMF+MI methods outperform Federated Latent Dirichlet Allocation (FedLDA) and the FedNMF without MI methods for short texts.
arXiv Detail & Related papers (2022-05-26T12:22:34Z) - Topic Discovery via Latent Space Clustering of Pretrained Language Model
Representations [35.74225306947918]
We propose a joint latent space learning and clustering framework built upon PLM embeddings.
Our model effectively leverages the strong representation power and superb linguistic features brought by PLMs for topic discovery.
arXiv Detail & Related papers (2022-02-09T17:26:08Z) - Novel Class Discovery in Semantic Segmentation [104.30729847367104]
We introduce a new setting of Novel Class Discovery in Semantic (NCDSS)
It aims at segmenting unlabeled images containing new classes given prior knowledge from a labeled set of disjoint classes.
In NCDSS, we need to distinguish the objects and background, and to handle the existence of multiple classes within an image.
We propose the Entropy-based Uncertainty Modeling and Self-training (EUMS) framework to overcome noisy pseudo-labels.
arXiv Detail & Related papers (2021-12-03T13:31:59Z) - CoPHE: A Count-Preserving Hierarchical Evaluation Metric in Large-Scale
Multi-Label Text Classification [70.554573538777]
We argue for hierarchical evaluation of the predictions of neural LMTC models.
We describe a structural issue in the representation of the structured label space in prior art.
We propose a set of metrics for hierarchical evaluation using the depth-based representation.
arXiv Detail & Related papers (2021-09-10T13:09:12Z) - TAN-NTM: Topic Attention Networks for Neural Topic Modeling [8.631228373008478]
We propose a novel framework: TAN-NTM which models document as a sequence of tokens instead of BoW at the input layer.
We apply attention on LSTM outputs to empower the model to attend on relevant words which convey topic related cues.
TAN-NTM achieves state-of-the-art results with 9-15 percentage improvement over score of existing SOTA topic models in NPMI coherence metric.
arXiv Detail & Related papers (2020-12-02T20:58:04Z) - Robust Document Representations using Latent Topics and Metadata [17.306088038339336]
We propose a novel approach to fine-tuning a pre-trained neural language model for document classification problems.
We generate document representations that capture both text and metadata artifacts in a task manner.
Our solution also incorporates metadata explicitly rather than just augmenting them with text.
arXiv Detail & Related papers (2020-10-23T21:52:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.