Comparative Approaches to Sentiment Analysis Using Datasets in Major European and Arabic Languages
- URL: http://arxiv.org/abs/2501.12540v1
- Date: Tue, 21 Jan 2025 23:11:16 GMT
- Title: Comparative Approaches to Sentiment Analysis Using Datasets in Major European and Arabic Languages
- Authors: Mikhail Krasitskii, Olga Kolesnikova, Liliana Chanona Hernandez, Grigori Sidorov, Alexander Gelbukh,
- Abstract summary: This study explores transformer-based models such as BERT, mBERT, and XLM-R for multi-lingual sentiment analysis.<n>Key contributions include the identification of XLM-R superior adaptability in morphologically complex languages, achieving accuracy levels above 88%.
- Score: 42.90274643419224
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This study explores transformer-based models such as BERT, mBERT, and XLM-R for multi-lingual sentiment analysis across diverse linguistic structures. Key contributions include the identification of XLM-R superior adaptability in morphologically complex languages, achieving accuracy levels above 88%. The work highlights fine-tuning strategies and emphasizes their significance for improving sentiment classification in underrepresented languages.
Related papers
- Multilingual Sentiment Analysis of Summarized Texts: A Cross-Language Study of Text Shortening Effects [42.90274643419224]
Summarization significantly impacts sentiment analysis across languages with diverse morphologies.
This study examines extractive and abstractive summarization effects on sentiment classification in English, German, French, Spanish, Italian, Finnish, Hungarian, and Arabic.
arXiv Detail & Related papers (2025-03-31T22:16:04Z) - Balanced Multi-Factor In-Context Learning for Multilingual Large Language Models [53.38288894305388]
Multilingual large language models (MLLMs) are able to leverage in-context learning (ICL) to achieve high performance by leveraging cross-lingual knowledge transfer without parameter updates.
Three key factors influence multilingual ICL: (1) semantic similarity, (2) linguistic alignment, and (3) language-specific performance.
We propose balanced multi-factor ICL (textbfBMF-ICL), a method that quantifies and optimally balances these factors for improved example selection.
arXiv Detail & Related papers (2025-02-17T06:56:33Z) - Beyond Data Quantity: Key Factors Driving Performance in Multilingual Language Models [1.5703073293718952]
Token similarity and country similarity as pivotal factors, alongside pre-train data and model size, in enhancing model performance.<n>These insights offer valuable guidance for developing more equitable and effective multilingual language models.
arXiv Detail & Related papers (2024-12-17T03:05:26Z) - Qtok: A Comprehensive Framework for Evaluating Multilingual Tokenizer Quality in Large Language Models [0.0]
The quality of tokenization can significantly impact a model's ability to handle diverse languages effectively.
We introduce Qtok, a tool designed to assess tokenizer quality with a specific emphasis on their performance in multilingual contexts.
Qtok applies these metrics to evaluate 13 distinct tokenizers from 58 publicly available models, analyzing their output across different linguistic contexts.
arXiv Detail & Related papers (2024-10-16T19:34:34Z) - LLM-based Translation Inference with Iterative Bilingual Understanding [52.46978502902928]
We propose a novel Iterative Bilingual Understanding Translation method based on the cross-lingual capabilities of large language models (LLMs)
The cross-lingual capability of LLMs enables the generation of contextual understanding for both the source and target languages separately.
The proposed IBUT outperforms several strong comparison methods.
arXiv Detail & Related papers (2024-10-16T13:21:46Z) - Exploring Multilingual Probing in Large Language Models: A Cross-Language Analysis [19.37853222555255]
Probing techniques for large language models (LLMs) have primarily focused on English, overlooking the vast majority of the world's languages.
We conduct experiments on several open-source LLM models, analyzing probing accuracy, trends across layers, and similarities between probing vectors for multiple languages.
arXiv Detail & Related papers (2024-09-22T14:14:05Z) - Comparative Analysis of Multilingual Text Classification &
Identification through Deep Learning and Embedding Visualization [0.0]
The study employs LangDetect, LangId, FastText, and Sentence Transformer on a dataset encompassing 17 languages.
The FastText multi-layer perceptron model achieved remarkable accuracy, precision, recall, and F1 score, outperforming the Sentence Transformer model.
arXiv Detail & Related papers (2023-12-06T12:03:27Z) - Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations [59.056367787688146]
This paper pioneers exploring and training powerful Multilingual Math Reasoning (xMR) LLMs.
We construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
By utilizing translation, we construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
arXiv Detail & Related papers (2023-10-31T08:09:20Z) - GradSim: Gradient-Based Language Grouping for Effective Multilingual
Training [13.730907708289331]
We propose GradSim, a language grouping method based on gradient similarity.
Our experiments on three diverse multilingual benchmark datasets show that it leads to the largest performance gains.
Besides linguistic features, the topics of the datasets play an important role for language grouping.
arXiv Detail & Related papers (2023-10-23T18:13:37Z) - Extrapolating Large Language Models to Non-English by Aligning Languages [109.09051737966178]
Existing large language models show disparate capability across different languages.
In this paper, we empower pre-trained LLMs on non-English languages by building semantic alignment across languages.
arXiv Detail & Related papers (2023-08-09T13:32:06Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - Evaluating Transformer-Based Multilingual Text Classification [55.53547556060537]
We argue that NLP tools perform unequally across languages with different syntactic and morphological structures.
We calculate word order and morphological similarity indices to aid our empirical study.
arXiv Detail & Related papers (2020-04-29T03:34:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.