Graph2topic: an opensource topic modeling framework based on sentence
embedding and community detection
- URL: http://arxiv.org/abs/2304.06653v3
- Date: Tue, 6 Jun 2023 11:56:27 GMT
- Title: Graph2topic: an opensource topic modeling framework based on sentence
embedding and community detection
- Authors: Leihang Zhang, Jiapeng Liu, Qiang Yan
- Abstract summary: Clustering-based topic models can generate better topics than generative probabilistic topic models.
We propose graph to topic (G2T), a simple but effective framework for topic modelling.
G2T achieved state-of-the-art performance on both English and Chinese documents with different lengths.
- Score: 1.6242924916178283
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: It has been reported that clustering-based topic models, which cluster
high-quality sentence embeddings with an appropriate word selection method, can
generate better topics than generative probabilistic topic models. However,
these approaches suffer from the inability to select appropriate parameters and
incomplete models that overlook the quantitative relation between words with
topics and topics with text. To solve these issues, we propose graph to topic
(G2T), a simple but effective framework for topic modelling. The framework is
composed of four modules. First, document representation is acquired using
pretrained language models. Second, a semantic graph is constructed according
to the similarity between document representations. Third, communities in
document semantic graphs are identified, and the relationship between topics
and documents is quantified accordingly. Fourth, the word--topic distribution
is computed based on a variant of TFIDF. Automatic evaluation suggests that G2T
achieved state-of-the-art performance on both English and Chinese documents
with different lengths.
Related papers
- Explaining Datasets in Words: Statistical Models with Natural Language Parameters [66.69456696878842]
We introduce a family of statistical models -- including clustering, time series, and classification models -- parameterized by natural language predicates.
We apply our framework to a wide range of problems: taxonomizing user chat dialogues, characterizing how they evolve across time, finding categories where one language model is better than the other.
arXiv Detail & Related papers (2024-09-13T01:40:20Z) - Conversational Semantic Parsing using Dynamic Context Graphs [68.72121830563906]
We consider the task of conversational semantic parsing over general purpose knowledge graphs (KGs) with millions of entities, and thousands of relation-types.
We focus on models which are capable of interactively mapping user utterances into executable logical forms.
arXiv Detail & Related papers (2023-05-04T16:04:41Z) - Topics in the Haystack: Extracting and Evaluating Topics beyond
Coherence [0.0]
We propose a method that incorporates a deeper understanding of both sentence and document themes.
This allows our model to detect latent topics that may include uncommon words or neologisms.
We present correlation coefficients with human identification of intruder words and achieve near-human level results at the word-intrusion task.
arXiv Detail & Related papers (2023-03-30T12:24:25Z) - Document-Level Relation Extraction with Sentences Importance Estimation
and Focusing [52.069206266557266]
Document-level relation extraction (DocRE) aims to determine the relation between two entities from a document of multiple sentences.
We propose a Sentence Estimation and Focusing (SIEF) framework for DocRE, where we design a sentence importance score and a sentence focusing loss.
Experimental results on two domains show that our SIEF not only improves overall performance, but also makes DocRE models more robust.
arXiv Detail & Related papers (2022-04-27T03:20:07Z) - Bilingual Topic Models for Comparable Corpora [9.509416095106491]
We propose a binding mechanism between the distributions of the paired documents.
To estimate the similarity of documents that are written in different languages we use cross-lingual word embeddings that are learned with shallow neural networks.
We evaluate the proposed binding mechanism by extending two topic models: a bilingual adaptation of LDA that assumes bag-of-words inputs and a model that incorporates part of the text structure in the form of boundaries of semantically coherent segments.
arXiv Detail & Related papers (2021-11-30T10:53:41Z) - Multiplex Graph Neural Network for Extractive Text Summarization [34.185093491514394]
Extractive text summarization aims at extracting the most representative sentences from a given document as its summary.
We propose a novel Multiplex Graph Convolutional Network (Multi-GCN) to jointly model different types of relationships among sentences and words.
Based on Multi-GCN, we propose a Multiplex Graph Summarization (Multi-GraS) model for extractive text summarization.
arXiv Detail & Related papers (2021-08-29T16:11:01Z) - Author Clustering and Topic Estimation for Short Texts [69.54017251622211]
We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document.
We also simultaneously cluster users, removing the need for post-hoc cluster estimation.
Our method performs as well as -- or better -- than traditional approaches to problems arising in short text.
arXiv Detail & Related papers (2021-06-15T20:55:55Z) - BASS: Boosting Abstractive Summarization with Unified Semantic Graph [49.48925904426591]
BASS is a framework for Boosting Abstractive Summarization based on a unified Semantic graph.
A graph-based encoder-decoder model is proposed to improve both the document representation and summary generation process.
Empirical results show that the proposed architecture brings substantial improvements for both long-document and multi-document summarization tasks.
arXiv Detail & Related papers (2021-05-25T16:20:48Z) - Topic Modeling with Contextualized Word Representation Clusters [8.49454123392354]
Clustering token-level contextualized word representations produces output that shares many similarities with topic models for English text collections.
We evaluate token clusterings trained from several different output layers of popular contextualized language models.
arXiv Detail & Related papers (2020-10-23T19:16:59Z) - Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for
Fast and Good Topics too! [5.819224524813161]
We propose an alternative way to obtain topics: clustering pre-trained word embeddings while incorporating document information for weighted clustering and reranking top words.
The best performing combination for our approach performs as well as classical topic models, but with lower runtime and computational complexity.
arXiv Detail & Related papers (2020-04-30T16:18:18Z) - Extractive Summarization as Text Matching [123.09816729675838]
This paper creates a paradigm shift with regard to the way we build neural extractive summarization systems.
We formulate the extractive summarization task as a semantic text matching problem.
We have driven the state-of-the-art extractive result on CNN/DailyMail to a new level (44.41 in ROUGE-1)
arXiv Detail & Related papers (2020-04-19T08:27:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.