A novel sentence embedding based topic detection method for micro-blog
- URL: http://arxiv.org/abs/2006.09977v1
- Date: Wed, 10 Jun 2020 09:58:57 GMT
- Title: A novel sentence embedding based topic detection method for micro-blog
- Authors: Cong Wan, Shan Jiang, Cuirong Wang, Cong Wang, Changming Xu, Xianxia
Chen, Ying Yuan
- Abstract summary: We present a novel approach based on neural network to detect topics in the micro-blogging dataset.
We use an unsupervised neural sentence embedding model to map the blogs to an embedding space.
In addition, we propose an improved clustering algorithm referred as relationship-aware DBSCAN (RADBSCAN)
- Score: 5.821169298644354
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Topic detection is a challenging task, especially without knowing the exact
number of topics. In this paper, we present a novel approach based on neural
network to detect topics in the micro-blogging dataset. We use an unsupervised
neural sentence embedding model to map the blogs to an embedding space. Our
model is a weighted power mean word embedding model, and the weights are
calculated by attention mechanism. Experimental result shows our embedding
method performs better than baselines in sentence clustering. In addition, we
propose an improved clustering algorithm referred as relationship-aware DBSCAN
(RADBSCAN). It can discover topics from a micro-blogging dataset, and the topic
number depends on dataset character itself. Moreover, in order to solve the
problem of parameters sensitive, we take blog forwarding relationship as a
bridge of two independent clusters. Finally, we validate our approach on a
dataset from sina micro-blog. The result shows that we can detect all the
topics successfully and extract keywords in each topic.
Related papers
- CAST: Corpus-Aware Self-similarity Enhanced Topic modelling [16.562349140796115]
We introduce CAST: Corpus-Aware Self-similarity Enhanced Topic modelling, a novel topic modelling method.
We find self-similarity to be an effective metric to prevent functional words from acting as candidate topic words.
Our approach significantly enhances the coherence and diversity of generated topics, as well as the topic model's ability to handle noisy data.
arXiv Detail & Related papers (2024-10-19T15:27:11Z) - Explaining Datasets in Words: Statistical Models with Natural Language Parameters [66.69456696878842]
We introduce a family of statistical models -- including clustering, time series, and classification models -- parameterized by natural language predicates.
We apply our framework to a wide range of problems: taxonomizing user chat dialogues, characterizing how they evolve across time, finding categories where one language model is better than the other.
arXiv Detail & Related papers (2024-09-13T01:40:20Z) - Integrating Large Language Models with Graph-based Reasoning for Conversational Question Answering [58.17090503446995]
We focus on a conversational question answering task which combines the challenges of understanding questions in context and reasoning over evidence gathered from heterogeneous sources like text, knowledge graphs, tables, and infoboxes.
Our method utilizes a graph structured representation to aggregate information about a question and its context.
arXiv Detail & Related papers (2024-06-14T13:28:03Z) - A Process for Topic Modelling Via Word Embeddings [0.0]
This work combines algorithms based on word embeddings, dimensionality reduction, and clustering.
The objective is to obtain topics from a set of unclassified texts.
arXiv Detail & Related papers (2023-10-06T15:10:35Z) - Topics in the Haystack: Extracting and Evaluating Topics beyond
Coherence [0.0]
We propose a method that incorporates a deeper understanding of both sentence and document themes.
This allows our model to detect latent topics that may include uncommon words or neologisms.
We present correlation coefficients with human identification of intruder words and achieve near-human level results at the word-intrusion task.
arXiv Detail & Related papers (2023-03-30T12:24:25Z) - Author Clustering and Topic Estimation for Short Texts [69.54017251622211]
We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document.
We also simultaneously cluster users, removing the need for post-hoc cluster estimation.
Our method performs as well as -- or better -- than traditional approaches to problems arising in short text.
arXiv Detail & Related papers (2021-06-15T20:55:55Z) - ConvoSumm: Conversation Summarization Benchmark and Improved Abstractive
Summarization with Argument Mining [61.82562838486632]
We crowdsource four new datasets on diverse online conversation forms of news comments, discussion forums, community question answering forums, and email threads.
We benchmark state-of-the-art models on our datasets and analyze characteristics associated with the data.
arXiv Detail & Related papers (2021-06-01T22:17:13Z) - Unsupervised Summarization for Chat Logs with Topic-Oriented Ranking and
Context-Aware Auto-Encoders [59.038157066874255]
We propose a novel framework called RankAE to perform chat summarization without employing manually labeled data.
RankAE consists of a topic-oriented ranking strategy that selects topic utterances according to centrality and diversity simultaneously.
A denoising auto-encoder is designed to generate succinct but context-informative summaries based on the selected utterances.
arXiv Detail & Related papers (2020-12-14T07:31:17Z) - BATS: A Spectral Biclustering Approach to Single Document Topic Modeling
and Segmentation [17.003488045214972]
Existing topic modeling and text segmentation methodologies generally require large datasets for training, limiting their capabilities when only small collections of text are available.
In developing a methodology to handle single documents, we face two major challenges.
First is sparse information: with access to only one document, we cannot train traditional topic models or deep learning algorithms.
Second is significant noise: a considerable portion of words in any single document will produce only noise and not help discern topics or segments.
arXiv Detail & Related papers (2020-08-05T16:34:33Z) - Improving unsupervised neural aspect extraction for online discussions
using out-of-domain classification [11.746330029375745]
We introduce a simple approach based on sentence filtering to improve topical aspects learned from newsgroups-based content.
The positive effect of sentence filtering on topic coherence is demonstrated in comparison to aspect extraction models trained on unfiltered texts.
arXiv Detail & Related papers (2020-06-17T10:34:16Z) - The Paradigm Discovery Problem [121.79963594279893]
We formalize the paradigm discovery problem and develop metrics for judging systems.
We report empirical results on five diverse languages.
Our code and data are available for public use.
arXiv Detail & Related papers (2020-05-04T16:38:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.