Related papers: Bambara Language Dataset for Sentiment Analysis

Bambara Language Dataset for Sentiment Analysis

URL: http://arxiv.org/abs/2108.02524v1
Date: Thu, 5 Aug 2021 11:07:18 GMT
Title: Bambara Language Dataset for Sentiment Analysis
Authors: Mountaga Diallo and Chayma Fourati and Hatem Haddad
Abstract summary: In Africa, various languages and dialects exist. However, they are still underrepresented and not fully exploited for analytical studies and research purposes. In this paper, we present the first common-crawl-based Bambara dialectal dataset dedicated for Sentiment Analysis.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: For easier communication, posting, or commenting on each others posts, people use their dialects. In Africa, various languages and dialects exist. However, they are still underrepresented and not fully exploited for analytical studies and research purposes. In order to perform approaches like Machine Learning and Deep Learning, datasets are required. One of the African languages is Bambara, used by citizens in different countries. However, no previous work on datasets for this language was performed for Sentiment Analysis. In this paper, we present the first common-crawl-based Bambara dialectal dataset dedicated for Sentiment Analysis, available freely for Natural Language Processing research purposes.

Related papers

One Language, Many Gaps: Evaluating Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks [68.33068005789116]
We present the first study aimed at objectively assessing the fairness and robustness of Large Language Models (LLMs) in handling dialects in canonical reasoning tasks. We hire AAVE speakers, including experts with computer science backgrounds, to rewrite seven popular benchmarks, such as HumanEval and GSM8K. Our findings reveal that textbfalmost all of these widely used models show significant brittleness and unfairness to queries in AAVE.
arXiv Detail & Related papers (2024-10-14T18:44:23Z)
Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers. Recent efforts to develop NLP technologies for African languages have focused on their standard dialects. We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z)
The First Swahili Language Scene Text Detection and Recognition Dataset [55.83178123785643]
There is a significant gap in low-resource languages, especially the Swahili Language. Swahili is widely spoken in East African countries but is still an under-explored language in scene text recognition. We propose a comprehensive dataset of Swahili scene text images and evaluate the dataset on different scene text detection and recognition models.
arXiv Detail & Related papers (2024-05-19T03:55:02Z)
Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets. This survey delves into an important attribute of these datasets: the dialect of a language. Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
XRICL: Cross-lingual Retrieval-Augmented In-Context Learning for Cross-lingual Text-to-SQL Semantic Parsing [70.40401197026925]
In-context learning using large language models has recently shown surprising results for semantic parsing tasks. This work introduces the XRICL framework, which learns to retrieve relevant English exemplars for a given query. We also include global translation exemplars for a target language to facilitate the translation process for large language models.
arXiv Detail & Related papers (2022-10-25T01:33:49Z)
Urdu Speech and Text Based Sentiment Analyzer [1.4630964945453113]
This work presented a new multi-class Urdu dataset based on user evaluations. Our proposed dataset includes 10,000 reviews that have been carefully classified into two categories by human experts: positive, negative. Five different lexicon- and rule-based algorithms including Naivebayes, Stanza, Textblob, Vader, and Flair are employed and the experimental results show that Flair with an accuracy of 70% outperforms other tested algorithms.
arXiv Detail & Related papers (2022-07-19T10:11:22Z)
NaijaSenti: A Nigerian Twitter Sentiment Corpus for Multilingual Sentiment Analysis [5.048355865260207]
We introduce the first large-scale human-annotated Twitter sentiment dataset for the four most widely spoken languages in Nigeria. The dataset consists of around 30,000 annotated tweets per language. We release the datasets, trained models, sentiment lexicons, and code to incentivize research on sentiment analysis in under-represented languages.
arXiv Detail & Related papers (2022-01-20T16:28:06Z)
Multilingual transfer of acoustic word embeddings improves when training on languages related to the target zero-resource language [32.170748231414365]
We show that training on even just a single related language gives the largest gain. We also find that adding data from unrelated languages generally doesn't hurt performance.
arXiv Detail & Related papers (2021-06-24T08:37:05Z)
Blow the Dog Whistle: A Chinese Dataset for Cant Understanding with Common Sense and World Knowledge [49.288196234823005]
Cant is important for understanding advertising, comedies and dog-whistle politics. We propose a large and diverse Chinese dataset for creating and understanding cant.
arXiv Detail & Related papers (2021-04-06T17:55:43Z)
The first large scale collection of diverse Hausa language datasets [0.0]
Hausa is considered well-studied and documented language among the sub-Saharan African languages. It is estimated that over 100 million people speak the language. We provide an expansive collection of curated datasets consisting of both formal and informal forms of the language.
arXiv Detail & Related papers (2021-02-13T19:34:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.