COVID-19-related Nepali Tweets Classification in a Low Resource Setting
- URL: http://arxiv.org/abs/2210.05425v1
- Date: Tue, 11 Oct 2022 13:08:37 GMT
- Title: COVID-19-related Nepali Tweets Classification in a Low Resource Setting
- Authors: Rabin Adhikari, Safal Thapaliya, Nirajan Basnet, Samip Poudel, Aman
Shakya, Bishesh Khanal
- Abstract summary: We identify the eight most common COVID-19 discussion topics among the Twitter community using the Nepali language.
We compare the performance of two state-of-the-art multi-lingual language models for Nepali tweet classification.
- Score: 0.15658704610960567
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Billions of people across the globe have been using social media platforms in
their local languages to voice their opinions about the various topics related
to the COVID-19 pandemic. Several organizations, including the World Health
Organization, have developed automated social media analysis tools that
classify COVID-19-related tweets into various topics. However, these tools that
help combat the pandemic are limited to very few languages, making several
countries unable to take their benefit. While multi-lingual or low-resource
language-specific tools are being developed, they still need to expand their
coverage, such as for the Nepali language. In this paper, we identify the eight
most common COVID-19 discussion topics among the Twitter community using the
Nepali language, set up an online platform to automatically gather Nepali
tweets containing the COVID-19-related keywords, classify the tweets into the
eight topics, and visualize the results across the period in a web-based
dashboard. We compare the performance of two state-of-the-art multi-lingual
language models for Nepali tweet classification, one generic (mBERT) and the
other Nepali language family-specific model (MuRIL). Our results show that the
models' relative performance depends on the data size, with MuRIL doing better
for a larger dataset. The annotated data, models, and the web-based dashboard
are open-sourced at https://github.com/naamiinepal/covid-tweet-classification.
Related papers
- SenWave: A Fine-Grained Multi-Language Sentiment Analysis Dataset Sourced from COVID-19 Tweets [42.98177831933239]
SenWave is a novel fine-grained multi-language sentiment analysis dataset specifically designed for analyzing COVID-19 tweets.<n>The dataset comprises 10,000 annotated tweets each in English and Arabic, along with 30,000 translated tweets in Spanish, French, and Italian, derived from English tweets.<n>Our study provides an in-depth analysis of the evolving emotional landscape across languages, countries, and topics, revealing significant insights over time.
arXiv Detail & Related papers (2025-10-09T13:38:05Z) - Development of Pre-Trained Transformer-based Models for the Nepali Language [0.0]
The Nepali language, spoken by approximately 32 million people worldwide, remains significantly underrepresented in this domain.
We have collected 27.5 GB of Nepali text data, approximately 2.4x larger than any previously available Nepali language corpus.
Our models outperformed the existing best model by 2 points on Nep-gLUE benchmark, scoring 95.60 and also outperformed existing models on text generation tasks.
arXiv Detail & Related papers (2024-11-24T06:38:24Z) - Navigating Text-to-Image Generative Bias across Indic Languages [53.92640848303192]
This research investigates biases in text-to-image (TTI) models for the Indic languages widely spoken across India.
It evaluates and compares the generative performance and cultural relevance of leading TTI models in these languages against their performance in English.
arXiv Detail & Related papers (2024-08-01T04:56:13Z) - Paramanu: A Family of Novel Efficient Generative Foundation Language Models for Indian Languages [3.9018931027384056]
We present "Paramanu", a family of novel language models (LM) for Indian languages.
It covers 10 languages (Assamese, Bangla, Hindi, Konkani, Maithili, Marathi, Odia, Sanskrit, Tamil, Telugu) across 5 scripts.
The models are pretrained on a single GPU with context size of 1024 and vary in size from 13.29 million (M) to 367.5 M parameters.
arXiv Detail & Related papers (2024-01-31T17:58:10Z) - Making a MIRACL: Multilingual Information Retrieval Across a Continuum
of Languages [62.730361829175415]
MIRACL is a multilingual dataset we have built for the WSDM 2023 Cup challenge.
It focuses on ad hoc retrieval across 18 different languages.
Our goal is to spur research that will improve retrieval across a continuum of languages.
arXiv Detail & Related papers (2022-10-18T16:47:18Z) - Multi-channel CNN to classify nepali covid-19 related tweets using
hybrid features [1.713291434132985]
We represent each tweet by combining both syntactic and semantic information, called hybrid features.
We design a novel multi-channel convolutional neural network (MCNN), which ensembles the multiple CNNs.
We evaluate the efficacy of both the proposed feature extraction method and the MCNN model classifying tweets on NepCOV19Tweets dataset.
arXiv Detail & Related papers (2022-03-19T09:55:05Z) - A New Generation of Perspective API: Efficient Multilingual
Character-level Transformers [66.9176610388952]
We present the fundamentals behind the next version of the Perspective API from Google Jigsaw.
At the heart of the approach is a single multilingual token-free Charformer model.
We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
arXiv Detail & Related papers (2022-02-22T20:55:31Z) - Towards Building ASR Systems for the Next Billion Users [15.867823754118422]
We make contributions towards building ASR systems for low resource languages from the Indian subcontinent.
First, we curate 17,000 hours of raw speech data for 40 Indian languages.
Using this raw speech data we pretrain several variants of wav2vec style models for 40 Indian languages.
arXiv Detail & Related papers (2021-11-06T19:34:33Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - No Rumours Please! A Multi-Indic-Lingual Approach for COVID Fake-Tweet
Detection [4.411285005377513]
We propose an approach to detect fake news about COVID-19 early on from social media, such as tweets, for multiple Indic-Languages besides English.
To expand our approach to multiple Indic languages, we resort to mBERT based model which is fine-tuned over created dataset in Hindi and Bengali.
Our approach reaches around 89% F-Score in fake tweet detection which supercedes the state-of-the-art (SOTA) results.
arXiv Detail & Related papers (2020-10-14T09:37:51Z) - TICO-19: the Translation Initiative for Covid-19 [112.5601530395345]
The Translation Initiative for COvid-19 (TICO-19) has made test and development data available to AI and MT researchers in 35 different languages.
The same data is translated into all of the languages represented, meaning that testing or development can be done for any pairing of languages in the set.
arXiv Detail & Related papers (2020-07-03T16:26:17Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.