A modified model for topic detection from a corpus and a new metric
evaluating the understandability of topics
- URL: http://arxiv.org/abs/2306.04941v1
- Date: Thu, 8 Jun 2023 05:17:03 GMT
- Title: A modified model for topic detection from a corpus and a new metric
evaluating the understandability of topics
- Authors: Tomoya Kitano, Yuto Miyatake, Daisuke Furihata
- Abstract summary: The new model builds upon the embedded topic model incorporating some modifications such as document clustering.
Numerical experiments suggest that the new model performs favourably regardless of the document's length.
The new metric, which can be computed more efficiently than widely-used metrics such as topic coherence, provides variable information regarding the understandability of the detected topics.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents a modified neural model for topic detection from a corpus
and proposes a new metric to evaluate the detected topics. The new model builds
upon the embedded topic model incorporating some modifications such as document
clustering. Numerical experiments suggest that the new model performs
favourably regardless of the document's length. The new metric, which can be
computed more efficiently than widely-used metrics such as topic coherence,
provides variable information regarding the understandability of the detected
topics.
Related papers
- Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored.
We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches.
We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z) - TopicAdapt- An Inter-Corpora Topics Adaptation Approach [27.450275637652418]
This paper proposes a neural topic model, TopicAdapt, that can adapt relevant topics from a related source corpus and also discover new topics in a target corpus that are absent in the source corpus.
Experiments over multiple datasets from diverse domains show the superiority of the proposed model against the state-of-the-art topic models.
arXiv Detail & Related papers (2023-10-08T02:56:44Z) - How Does Generative Retrieval Scale to Millions of Passages? [68.98628807288972]
We conduct the first empirical study of generative retrieval techniques across various corpus scales.
We scale generative retrieval to millions of passages with a corpus of 8.8M passages and evaluating model sizes up to 11B parameters.
While generative retrieval is competitive with state-of-the-art dual encoders on small corpora, scaling to millions of passages remains an important and unsolved challenge.
arXiv Detail & Related papers (2023-05-19T17:33:38Z) - Improving Contextualized Topic Models with Negative Sampling [3.708656266586146]
We propose a negative sampling mechanism for a contextualized topic model to improve the quality of the generated topics.
In particular, during model training, we perturb the generated document-topic vector and use a triplet loss to encourage the document reconstructed from the correct document-topic vector to be similar to the input document.
arXiv Detail & Related papers (2023-03-27T07:28:46Z) - HanoiT: Enhancing Context-aware Translation via Selective Context [95.93730812799798]
Context-aware neural machine translation aims to use the document-level context to improve translation quality.
The irrelevant or trivial words may bring some noise and distract the model from learning the relationship between the current sentence and the auxiliary context.
We propose a novel end-to-end encoder-decoder model with a layer-wise selection mechanism to sift and refine the long document context.
arXiv Detail & Related papers (2023-01-17T12:07:13Z) - Coordinated Topic Modeling [10.710176350043998]
We propose a new problem called coordinated topic modeling that imitates human behavior while describing a text corpus.
We design ECTM, an embedding-based coordinated topic model that effectively uses the reference representation to capture the target corpus-specific aspects.
In ECTM, we introduce the topic- and document-level supervision with a self-training mechanism to solve the problem.
arXiv Detail & Related papers (2022-10-16T15:10:54Z) - Knowledge-Aware Bayesian Deep Topic Model [50.58975785318575]
We propose a Bayesian generative model for incorporating prior domain knowledge into hierarchical topic modeling.
Our proposed model efficiently integrates the prior knowledge and improves both hierarchical topic discovery and document representation.
arXiv Detail & Related papers (2022-09-20T09:16:05Z) - Representing Mixtures of Word Embeddings with Mixtures of Topic
Embeddings [46.324584649014284]
A topic model is often formulated as a generative model that explains how each word of a document is generated given a set of topics and document-specific topic proportions.
This paper introduces a new topic-modeling framework where each document is viewed as a set of word embedding vectors and each topic is modeled as an embedding vector in the same embedding space.
Embedding the words and topics in the same vector space, we define a method to measure the semantic difference between the embedding vectors of the words of a document and these of the topics, and optimize the topic embeddings to minimize the expected difference over all documents.
arXiv Detail & Related papers (2022-03-03T08:46:23Z) - Semiparametric Latent Topic Modeling on Consumer-Generated Corpora [0.0]
This paper proposes semiparametric topic model, a two-step approach utilizing nonnegative matrix factorization and semiparametric regression in topic modeling.
The model enables the reconstruction of sparse topic structures in the corpus and provides a generative model for predicting topics in new documents entering the corpus.
In an actual consumer feedback corpus, the model also demonstrably provides interpretable and useful topic definitions comparable with those produced by other methods.
arXiv Detail & Related papers (2021-07-13T00:22:02Z) - Author Clustering and Topic Estimation for Short Texts [69.54017251622211]
We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document.
We also simultaneously cluster users, removing the need for post-hoc cluster estimation.
Our method performs as well as -- or better -- than traditional approaches to problems arising in short text.
arXiv Detail & Related papers (2021-06-15T20:55:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.