Multilingual transformer and BERTopic for short text topic modeling: The
case of Serbian
- URL: http://arxiv.org/abs/2402.03067v1
- Date: Mon, 5 Feb 2024 14:59:29 GMT
- Title: Multilingual transformer and BERTopic for short text topic modeling: The
case of Serbian
- Authors: Darija Medvecki, Bojana Ba\v{s}aragin, Adela Ljaji\'c, Nikola
Milo\v{s}evi\'c
- Abstract summary: This paper presents the results of the first application of BERTopic, a state-of-the-art topic modeling technique, to short text written in a morphologi-cally rich language.
We applied BERTopic with three multilingual embed-ding models on two levels of text preprocessing (partial and full) to evalu-ate its performance on partially preprocessed short text in Serbian.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents the results of the first application of BERTopic, a
state-of-the-art topic modeling technique, to short text written in a
morphologi-cally rich language. We applied BERTopic with three multilingual
embed-ding models on two levels of text preprocessing (partial and full) to
evalu-ate its performance on partially preprocessed short text in Serbian. We
also compared it to LDA and NMF on fully preprocessed text. The experiments
were conducted on a dataset of tweets expressing hesitancy toward COVID-19
vaccination. Our results show that with adequate parameter setting, BERTopic
can yield informative topics even when applied to partially pre-processed short
text. When the same parameters are applied in both prepro-cessing scenarios,
the performance drop on partially preprocessed text is minimal. Compared to LDA
and NMF, judging by the keywords, BERTopic offers more informative topics and
gives novel insights when the number of topics is not limited. The findings of
this paper can be significant for re-searchers working with other
morphologically rich low-resource languages and short text.
Related papers
- BERTopic for Topic Modeling of Hindi Short Texts: A Comparative Study [1.1650821883155187]
This study investigates the performance of BERTopic in modeling Hindi short texts.
Using contextual embeddings, BERTopic can capture semantic relationships in data, making it potentially more effective than traditional models.
arXiv Detail & Related papers (2025-01-07T14:53:35Z) - Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation.
We introduce novel methodologies and datasets to overcome these challenges.
We propose MhBART, an encoder-decoder model designed to emulate human writing style.
We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z) - Unveiling the Potential of BERTopic for Multilingual Fake News Analysis -- Use Case: Covid-19 [0.562479170374811]
BERTopic consists of sentence embedding, dimension reduction, clustering, and topic extraction.
This paper aims to analyse the technical application of BERTopic in practice.
It also aims to analyse the results of topic modeling on real world data as a use case.
arXiv Detail & Related papers (2024-07-11T11:47:43Z) - Let the Pretrained Language Models "Imagine" for Short Texts Topic
Modeling [29.87929724277381]
In short texts, co-occurrence information is minimal, which results in feature sparsity in document representation.
Existing topic models (probabilistic or neural) mostly fail to mine patterns from them to generate coherent topics.
We extend short text into longer sequences using existing pre-trained language models (PLMs)
arXiv Detail & Related papers (2023-10-24T00:23:30Z) - MetricPrompt: Prompting Model as a Relevance Metric for Few-shot Text
Classification [65.51149771074944]
MetricPrompt eases verbalizer design difficulty by reformulating few-shot text classification task into text pair relevance estimation task.
We conduct experiments on three widely used text classification datasets across four few-shot settings.
Results show that MetricPrompt outperforms manual verbalizer and other automatic verbalizer design methods across all few-shot settings.
arXiv Detail & Related papers (2023-06-15T06:51:35Z) - Prompting Large Language Model for Machine Translation: A Case Study [87.88120385000666]
We offer a systematic study on prompting strategies for machine translation.
We examine factors for prompt template and demonstration example selection.
We explore the use of monolingual data and the feasibility of cross-lingual, cross-domain, and sentence-to-document transfer learning.
arXiv Detail & Related papers (2023-01-17T18:32:06Z) - ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in
Text-to-Speech [96.0009517132463]
We introduce a word-level prosody encoder, which quantizes the low-frequency band of the speech and compresses prosody attributes in the latent prosody vector (LPV)
We then introduce an LPV predictor, which predicts LPV given word sequence and fine-tune it on the high-quality TTS dataset.
Experimental results show that ProsoSpeech can generate speech with richer prosody compared with baseline methods.
arXiv Detail & Related papers (2022-02-16T01:42:32Z) - Shaking Syntactic Trees on the Sesame Street: Multilingual Probing with
Controllable Perturbations [2.041108289731398]
Recent research has adopted a new experimental field centered around the concept of text perturbations.
Recent research has revealed that shuffled word order has little to no impact on the downstream performance of Transformer-based language models.
arXiv Detail & Related papers (2021-09-28T20:15:29Z) - Fine-tuning GPT-3 for Russian Text Summarization [77.34726150561087]
This paper showcases ruGPT3 ability to summarize texts, fine-tuning it on the corpora of Russian news with their corresponding human-generated summaries.
We evaluate the resulting texts with a set of metrics, showing that our solution can surpass the state-of-the-art model's performance without additional changes in architecture or loss function.
arXiv Detail & Related papers (2021-08-07T19:01:40Z) - Context Reinforced Neural Topic Modeling over Short Texts [15.487822291146689]
We propose a Context Reinforced Neural Topic Model (CRNTM)
CRNTM infers the topic for each word in a narrow range by assuming that each short text covers only a few salient topics.
Experiments on two benchmark datasets validate the effectiveness of the proposed model on both topic discovery and text classification.
arXiv Detail & Related papers (2020-08-11T06:41:53Z) - Enabling Language Models to Fill in the Blanks [81.59381915581892]
We present a simple approach for text infilling, the task of predicting missing spans of text at any position in a document.
We train (or fine-tune) off-the-shelf language models on sequences containing the concatenation of artificially-masked text and the text which was masked.
We show that this approach, which we call infilling by language modeling, can enable LMs to infill entire sentences effectively on three different domains: short stories, scientific abstracts, and lyrics.
arXiv Detail & Related papers (2020-05-11T18:00:03Z) - Towards Making the Most of Context in Neural Machine Translation [112.9845226123306]
We argue that previous research did not make a clear use of the global context.
We propose a new document-level NMT framework that deliberately models the local context of each sentence.
arXiv Detail & Related papers (2020-02-19T03:30:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.