L3CubeMahaSent: A Marathi Tweet-based Sentiment Analysis Dataset
- URL: http://arxiv.org/abs/2103.11408v1
- Date: Sun, 21 Mar 2021 14:22:13 GMT
- Title: L3CubeMahaSent: A Marathi Tweet-based Sentiment Analysis Dataset
- Authors: Atharva Kulkarni, Meet Mandhane, Manali Likhitkar, Gayatri Kshirsagar,
Raviraj Joshi
- Abstract summary: This paper presents the first major publicly available Marathi Sentiment Analysis dataset - L3MahaSent.
It is curated using tweets extracted from various Maharashtrian personalities' Twitter accounts.
Our dataset consists of 16,000 distinct tweets classified in three broad classes viz. positive, negative, and neutral.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sentiment analysis is one of the most fundamental tasks in Natural Language
Processing. Popular languages like English, Arabic, Russian, Mandarin, and also
Indian languages such as Hindi, Bengali, Tamil have seen a significant amount
of work in this area. However, the Marathi language which is the third most
popular language in India still lags behind due to the absence of proper
datasets. In this paper, we present the first major publicly available Marathi
Sentiment Analysis Dataset - L3CubeMahaSent. It is curated using tweets
extracted from various Maharashtrian personalities' Twitter accounts. Our
dataset consists of ~16,000 distinct tweets classified in three broad classes
viz. positive, negative, and neutral. We also present the guidelines using
which we annotated the tweets. Finally, we present the statistics of our
dataset and baseline classification results using CNN, LSTM, ULMFiT, and
BERT-based deep learning models.
Related papers
- Navigating Text-to-Image Generative Bias across Indic Languages [53.92640848303192]
This research investigates biases in text-to-image (TTI) models for the Indic languages widely spoken across India.
It evaluates and compares the generative performance and cultural relevance of leading TTI models in these languages against their performance in English.
arXiv Detail & Related papers (2024-08-01T04:56:13Z) - Breaking Language Barriers: A Question Answering Dataset for Hindi and
Marathi [1.03590082373586]
This paper focuses on developing a Question Answering dataset for two such languages- Hindi and Marathi.
Despite Hindi being the 3rd most spoken language worldwide, and Marathi being the 11th most spoken language globally, both languages face limited resources for building efficient Question Answering systems.
We release the largest Question-Answering dataset available for these languages, with each dataset containing 28,000 samples.
arXiv Detail & Related papers (2023-08-19T00:39:21Z) - AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages [45.88640066767242]
Africa is home to over 2,000 languages from more than six language families and has the highest linguistic diversity among all continents.
Yet, there is little NLP research conducted on African languages. Crucial to enabling such research is the availability of high-quality annotated datasets.
In this paper, we introduce AfriSenti, a sentiment analysis benchmark that contains a total of >110,000 tweets in 14 African languages.
arXiv Detail & Related papers (2023-02-17T15:40:12Z) - MultiSpider: Towards Benchmarking Multilingual Text-to-SQL Semantic
Parsing [48.216386761482525]
We present MultiSpider, the largest multilingual text-to- schema- dataset which covers seven languages (English, German, French, Spanish, Japanese, Chinese, and Vietnamese)
Experimental results under three typical settings (zero-shot, monolingual and multilingual) reveal a 6.1% absolute drop in accuracy in non-English languages.
We also propose a simple framework augmentation framework SAVe (Augmentation-with-Verification) which boosts the overall performance by about 1.8% and closes the 29.5% performance gap across languages.
arXiv Detail & Related papers (2022-12-27T13:58:30Z) - L3Cube-MahaNLP: Marathi Natural Language Processing Datasets, Models,
and Library [1.14219428942199]
Despite being the third most popular language in India, the Marathi language lacks useful NLP resources.
With L3Cube-MahaNLP, we aim to build resources and a library for Marathi natural language processing.
We present datasets and transformer models for supervised tasks like sentiment analysis, named entity recognition, and hate speech detection.
arXiv Detail & Related papers (2022-05-29T17:51:00Z) - Mono vs Multilingual BERT for Hate Speech Detection and Text
Classification: A Case Study in Marathi [0.966840768820136]
We focus on the Marathi language and evaluate the models on the datasets for hate speech detection, sentiment analysis and simple text classification in Marathi.
We use standard multilingual models such as mBERT, indicBERT and xlm-RoBERTa and compare with MahaBERT, MahaALBERT and MahaRoBERTa, the monolingual models for Marathi.
We show that monolingual MahaBERT based models provide rich representations as compared to sentence embeddings from multi-lingual counterparts.
arXiv Detail & Related papers (2022-04-19T05:07:58Z) - L3Cube-MahaHate: A Tweet-based Marathi Hate Speech Detection Dataset and
BERT models [0.7874708385247353]
In India, Marathi is one of the most popular languages used by a wide audience.
In this work, we present L3Cube-MahaHate, the first major Hate Speech dataset in Marathi.
arXiv Detail & Related papers (2022-03-25T17:00:33Z) - NaijaSenti: A Nigerian Twitter Sentiment Corpus for Multilingual
Sentiment Analysis [5.048355865260207]
We introduce the first large-scale human-annotated Twitter sentiment dataset for the four most widely spoken languages in Nigeria.
The dataset consists of around 30,000 annotated tweets per language.
We release the datasets, trained models, sentiment lexicons, and code to incentivize research on sentiment analysis in under-represented languages.
arXiv Detail & Related papers (2022-01-20T16:28:06Z) - Challenge Dataset of Cognates and False Friend Pairs from Indian
Languages [54.6340870873525]
Cognates are present in multiple variants of the same text across different languages.
In this paper, we describe the creation of two cognate datasets for twelve Indian languages.
arXiv Detail & Related papers (2021-12-17T14:23:43Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - A Multilingual Parallel Corpora Collection Effort for Indian Languages [43.62422999765863]
We present sentence aligned parallel corpora across 10 Indian languages - Hindi, Telugu, Tamil, Malayalam, Gujarati, Urdu, Bengali, Oriya, Marathi, Punjabi, and English.
The corpora are compiled from online sources which have content shared across languages.
arXiv Detail & Related papers (2020-07-15T14:00:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.