BanglaNLP at BLP-2023 Task 2: Benchmarking different Transformer Models
for Sentiment Analysis of Bangla Social Media Posts
- URL: http://arxiv.org/abs/2310.09238v2
- Date: Wed, 18 Oct 2023 03:51:38 GMT
- Title: BanglaNLP at BLP-2023 Task 2: Benchmarking different Transformer Models
for Sentiment Analysis of Bangla Social Media Posts
- Authors: Saumajit Saha and Albert Nanda
- Abstract summary: This paper presents our submission to Task 2 (Sentiment Analysis of Bangla Social Media Posts) of the BLP Workshop.
Our quantitative results show that transfer learning really helps in better learning of the models in this low-resource language scenario.
We obtain a micro-F1 of 67.02% on the test set and our performance in this shared task is ranked at 21 in the leaderboard.
- Score: 0.46040036610482665
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Bangla is the 7th most widely spoken language globally, with a staggering 234
million native speakers primarily hailing from India and Bangladesh. This
morphologically rich language boasts a rich literary tradition, encompassing
diverse dialects and language-specific challenges. Despite its linguistic
richness and history, Bangla remains categorized as a low-resource language
within the natural language processing (NLP) and speech community. This paper
presents our submission to Task 2 (Sentiment Analysis of Bangla Social Media
Posts) of the BLP Workshop. We experiment with various Transformer-based
architectures to solve this task. Our quantitative results show that transfer
learning really helps in better learning of the models in this low-resource
language scenario. This becomes evident when we further finetune a model which
has already been finetuned on twitter data for sentiment analysis task and that
finetuned model performs the best among all other models. We also perform a
detailed error analysis where we find some instances where ground truth labels
need to be relooked at. We obtain a micro-F1 of 67.02\% on the test set and our
performance in this shared task is ranked at 21 in the leaderboard.
Related papers
- BongLLaMA: LLaMA for Bangla Language [0.0]
BongLLaMA is an open-source large language model fine-tuned exclusively on large Bangla corpora and instruction-tuning datasets.
We present our methodology, data augmentation techniques, fine-tuning details, and comprehensive benchmarking results showcasing the utility of BongLLaMA on BLP tasks.
arXiv Detail & Related papers (2024-10-28T16:44:02Z) - The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights [108.40766216456413]
We propose a question alignment framework to bridge the gap between large language models' English and non-English performance.
Experiment results show it can boost multilingual performance across diverse reasoning scenarios, model families, and sizes.
We analyze representation space, generated response and data scales, and reveal how question translation training strengthens language alignment within LLMs.
arXiv Detail & Related papers (2024-05-02T14:49:50Z) - RSM-NLP at BLP-2023 Task 2: Bangla Sentiment Analysis using Weighted and
Majority Voted Fine-Tuned Transformers [2.048226951354646]
This paper describes our approach to submissions made at BLP Workshop - Sentiment Analysis of Bangla Social Media Posts.
Our system scored 0.711 for the multiclass classification task and scored 10th place among the participants on the leaderboard for the shared task.
arXiv Detail & Related papers (2023-10-22T10:55:56Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Baichuan 2: Open Large-scale Language Models [51.56361715162972]
We present Baichuan 2, a series of large-scale multilingual language models containing 7 billion and 13 billion parameters, trained from scratch, on 2.6 trillion tokens.
Baichuan 2 matches or outperforms other open-source models of similar size on public benchmarks like MMLU, CMMLU, GSM8K, and HumanEval.
arXiv Detail & Related papers (2023-09-19T04:13:22Z) - On Evaluation of Bangla Word Analogies [0.8658596218544772]
This paper presents a high-quality dataset for evaluating the quality of Bangla word embeddings.
Despite being the 7th most-spoken language in the world, Bangla is a low-resource language and popular NLP models fail to perform well.
arXiv Detail & Related papers (2023-04-10T14:27:35Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - A Review of Bangla Natural Language Processing Tasks and the Utility of
Transformer Models [2.5768647103950357]
We provide a review of Bangla NLP tasks, resources, and tools available to the research community.
We benchmark datasets collected from various platforms for nine NLP tasks using current state-of-the-art algorithms.
We report our results using both individual and consolidated datasets and provide data for future research.
arXiv Detail & Related papers (2021-07-08T13:49:46Z) - BanglaBERT: Combating Embedding Barrier for Low-Resource Language
Understanding [1.7000879291900044]
We build a Bangla natural language understanding model pre-trained on 18.6 GB data we crawled from top Bangla sites on the internet.
Our model outperforms multilingual baselines and previous state-of-the-art results by 1-6%.
We identify a major shortcoming of multilingual models that hurt performance for low-resource languages that don't share writing scripts with any high resource one.
arXiv Detail & Related papers (2021-01-01T09:28:45Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z) - Kungfupanda at SemEval-2020 Task 12: BERT-Based Multi-Task Learning for
Offensive Language Detection [55.445023584632175]
We build an offensive language detection system, which combines multi-task learning with BERT-based models.
Our model achieves 91.51% F1 score in English Sub-task A, which is comparable to the first place.
arXiv Detail & Related papers (2020-04-28T11:27:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.