Experiments on Generalizability of BERTopic on Multi-Domain Short Text
- URL: http://arxiv.org/abs/2212.08459v1
- Date: Fri, 16 Dec 2022 13:07:39 GMT
- Title: Experiments on Generalizability of BERTopic on Multi-Domain Short Text
- Authors: Muri\"el de Groot, Mohammad Aliannejadi, Marcel R. Haas
- Abstract summary: We explore how the state-of-the-art BERTopic algorithm performs on short multi-domain text.
We analyze the performance of the HDBSCAN clustering algorithm utilized by BERTopic.
When we replace HDBSCAN with k-Means, we achieve similar performance, but without outliers.
- Score: 2.352645870795664
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Topic modeling is widely used for analytically evaluating large collections
of textual data. One of the most popular topic techniques is Latent Dirichlet
Allocation (LDA), which is flexible and adaptive, but not optimal for e.g.
short texts from various domains. We explore how the state-of-the-art BERTopic
algorithm performs on short multi-domain text and find that it generalizes
better than LDA in terms of topic coherence and diversity. We further analyze
the performance of the HDBSCAN clustering algorithm utilized by BERTopic and
find that it classifies a majority of the documents as outliers. This crucial,
yet overseen problem excludes too many documents from further analysis. When we
replace HDBSCAN with k-Means, we achieve similar performance, but without
outliers.
Related papers
- Mitigating Boundary Ambiguity and Inherent Bias for Text Classification in the Era of Large Language Models [24.085614720512744]
This study shows that large language models (LLMs) are vulnerable to changes in the number and arrangement of options in text classification.
Key bottleneck arises from ambiguous decision boundaries and inherent biases towards specific tokens and positions.
Our approach is grounded in the empirical observation that pairwise comparisons can effectively alleviate boundary ambiguity and inherent bias.
arXiv Detail & Related papers (2024-06-11T06:53:19Z) - FaiMA: Feature-aware In-context Learning for Multi-domain Aspect-based
Sentiment Analysis [1.606149016749251]
Multi-domain aspect-based sentiment analysis (ABSA) seeks to capture fine-grained sentiment across diverse domains.
We propose a novel framework, Feature-aware In-context Learning for Multi-domain ABSA (FaiMA)
FaiMA is a feature-aware mechanism that facilitates adaptive learning in multi-domain ABSA tasks.
arXiv Detail & Related papers (2024-03-02T02:00:51Z) - Domain-Expanded ASTE: Rethinking Generalization in Aspect Sentiment Triplet Extraction [67.54420015049732]
Aspect Sentiment Triplet Extraction (ASTE) is a challenging task in sentiment analysis, aiming to provide fine-grained insights into human sentiments.
Existing benchmarks are limited to two domains and do not evaluate model performance on unseen domains.
We introduce a domain-expanded benchmark by annotating samples from diverse domains, enabling evaluation of models in both in-domain and out-of-domain settings.
arXiv Detail & Related papers (2023-05-23T18:01:49Z) - Using Set Covering to Generate Databases for Holistic Steganalysis [2.089615335919449]
We explore a grid of processing pipelines to study the origins of Cover Source Mismatch (CSM)
A set-covering greedy algorithm is used to select representative pipelines minimizing the maximum regret between the representative and the pipelines within the set.
Our analysis also shows that parameters as denoising, sharpening, and downsampling are very important to foster diversity.
arXiv Detail & Related papers (2022-11-07T10:53:02Z) - Entity Disambiguation with Entity Definitions [50.01142092276296]
Local models have recently attained astounding performances in Entity Disambiguation (ED)
Previous works limited their studies to using, as the textual representation of each candidate, only its Wikipedia title.
In this paper, we address this limitation and investigate to what extent more expressive textual representations can mitigate it.
We report a new state of the art on 2 out of 6 benchmarks we consider and strongly improve the generalization capability over unseen patterns.
arXiv Detail & Related papers (2022-10-11T17:46:28Z) - A Simple Information-Based Approach to Unsupervised Domain-Adaptive
Aspect-Based Sentiment Analysis [58.124424775536326]
We propose a simple but effective technique based on mutual information to extract their term.
Experiment results show that our proposed method outperforms the state-of-the-art methods for cross-domain ABSA by 4.32% Micro-F1.
arXiv Detail & Related papers (2022-01-29T10:18:07Z) - Community-Detection via Hashtag-Graphs for Semi-Supervised NMF Topic
Models [0.0]
This paper outlines a novel approach on how to integrate topic structures of hashtag graphs into the estimation of topic models.
By applying this approach on recently streamed Twitter data it will be seen that this procedure actually leads to more intuitive and humanly interpretable topics.
arXiv Detail & Related papers (2021-11-17T12:52:16Z) - Author Clustering and Topic Estimation for Short Texts [69.54017251622211]
We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document.
We also simultaneously cluster users, removing the need for post-hoc cluster estimation.
Our method performs as well as -- or better -- than traditional approaches to problems arising in short text.
arXiv Detail & Related papers (2021-06-15T20:55:55Z) - Text Summarization with Latent Queries [60.468323530248945]
We introduce LaQSum, the first unified text summarization system that learns Latent Queries from documents for abstractive summarization with any existing query forms.
Under a deep generative framework, our system jointly optimize a latent query model and a conditional language model, allowing users to plug-and-play queries of any type at test time.
Our system robustly outperforms strong comparison systems across summarization benchmarks with different query types, document settings, and target domains.
arXiv Detail & Related papers (2021-05-31T21:14:58Z) - WikiAsp: A Dataset for Multi-domain Aspect-based Summarization [69.13865812754058]
We propose WikiAsp, a large-scale dataset for multi-domain aspect-based summarization.
Specifically, we build the dataset using Wikipedia articles from 20 different domains, using the section titles and boundaries of each article as a proxy for aspect annotation.
Results highlight key challenges that existing summarization models face in this setting, such as proper pronoun handling of quoted sources and consistent explanation of time-sensitive events.
arXiv Detail & Related papers (2020-11-16T10:02:52Z) - MultiGBS: A multi-layer graph approach to biomedical summarization [6.11737116137921]
We propose a domain-specific method that models a document as a multi-layer graph to enable multiple features of the text to be processed at the same time.
The unsupervised method selects sentences from the multi-layer graph based on the MultiRank algorithm and the number of concepts.
The proposed MultiGBS algorithm employs UMLS and extracts the concepts and relationships using different tools such as SemRep, MetaMap, and OGER.
arXiv Detail & Related papers (2020-08-27T04:22:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.