Robust Sentiment Analysis for Low Resource languages Using Data
Augmentation Approaches: A Case Study in Marathi
- URL: http://arxiv.org/abs/2310.00734v1
- Date: Sun, 1 Oct 2023 17:09:31 GMT
- Title: Robust Sentiment Analysis for Low Resource languages Using Data
Augmentation Approaches: A Case Study in Marathi
- Authors: Aabha Pingle, Aditya Vyawahare, Isha Joshi, Rahul Tangsali, Geetanjali
Kale, Raviraj Joshi
- Abstract summary: Sentiment analysis plays a crucial role in understanding the sentiment expressed in text data.
There exists a significant gap in research efforts for sentiment analysis in low-resource languages.
We present an exhaustive study of data augmentation approaches for the low-resource Indic language Marathi.
- Score: 0.9553673944187253
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sentiment analysis plays a crucial role in understanding the sentiment
expressed in text data. While sentiment analysis research has been extensively
conducted in English and other Western languages, there exists a significant
gap in research efforts for sentiment analysis in low-resource languages.
Limited resources, including datasets and NLP research, hinder the progress in
this area. In this work, we present an exhaustive study of data augmentation
approaches for the low-resource Indic language Marathi. Although
domain-specific datasets for sentiment analysis in Marathi exist, they often
fall short when applied to generalized and variable-length inputs. To address
this challenge, this research paper proposes four data augmentation techniques
for sentiment analysis in Marathi. The paper focuses on augmenting existing
datasets to compensate for the lack of sufficient resources. The primary
objective is to enhance sentiment analysis model performance in both in-domain
and cross-domain scenarios by leveraging data augmentation strategies. The data
augmentation approaches proposed showed a significant performance improvement
for cross-domain accuracies. The augmentation methods include paraphrasing,
back-translation; BERT-based random token replacement, named entity
replacement, and pseudo-label generation; GPT-based text and label generation.
Furthermore, these techniques can be extended to other low-resource languages
and for general text classification tasks.
Related papers
- Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - When a Language Question Is at Stake. A Revisited Approach to Label
Sensitive Content [0.0]
Article revisits an approach of pseudo-labeling sensitive data on the example of Ukrainian tweets covering the Russian-Ukrainian war.
We provide a fundamental statistical analysis of the obtained data, evaluation of models used for pseudo-labelling, and set further guidelines on how the scientists can leverage the corpus.
arXiv Detail & Related papers (2023-11-17T13:35:10Z) - Language Model Decoding as Direct Metrics Optimization [87.68281625776282]
Current decoding methods struggle to generate texts that align with human texts across different aspects.
In this work, we frame decoding from a language model as an optimization problem with the goal of strictly matching the expected performance with human texts.
We prove that this induced distribution is guaranteed to improve the perplexity on human texts, which suggests a better approximation to the underlying distribution of human texts.
arXiv Detail & Related papers (2023-10-02T09:35:27Z) - AugGPT: Leveraging ChatGPT for Text Data Augmentation [59.76140039943385]
We propose a text data augmentation approach based on ChatGPT (named AugGPT)
AugGPT rephrases each sentence in the training samples into multiple conceptually similar but semantically different samples.
Experiment results on few-shot learning text classification tasks show the superior performance of the proposed AugGPT approach.
arXiv Detail & Related papers (2023-02-25T06:58:16Z) - Cross-lingual Argument Mining in the Medical Domain [6.0158981171030685]
We show how to perform Argument Mining (AM) in medical texts for which no annotated data is available.
Our work shows that automatically translating and projecting annotations (data-transfer) from English to a given target language is an effective way to generate annotated data.
We also show how the automatically generated data in Spanish can also be used to improve results in the original English monolingual setting.
arXiv Detail & Related papers (2023-01-25T11:21:12Z) - Research Trends and Applications of Data Augmentation Algorithms [77.34726150561087]
We identify the main areas of application of data augmentation algorithms, the types of algorithms used, significant research trends, their progression over time and research gaps in data augmentation literature.
We expect readers to understand the potential of data augmentation, as well as identify future research directions and open questions within data augmentation research.
arXiv Detail & Related papers (2022-07-18T11:38:32Z) - A Dataset and BERT-based Models for Targeted Sentiment Analysis on
Turkish Texts [0.0]
We present an annotated Turkish dataset suitable for targeted sentiment analysis.
We propose BERT-based models with different architectures to accomplish the task of targeted sentiment analysis.
arXiv Detail & Related papers (2022-05-09T10:57:39Z) - To Augment or Not to Augment? A Comparative Study on Text Augmentation
Techniques for Low-Resource NLP [0.0]
We investigate three categories of text augmentation methodologies which perform changes on the syntax.
We compare them on part-of-speech tagging, dependency parsing and semantic role labeling for a diverse set of language families.
Our results suggest that the augmentation techniques can further improve over strong baselines based on mBERT.
arXiv Detail & Related papers (2021-11-18T10:52:48Z) - AUGVIC: Exploiting BiText Vicinity for Low-Resource NMT [9.797319790710711]
AUGVIC is a novel data augmentation framework for low-resource NMT.
It exploits the vicinal samples of the given bitext without using any extra monolingual data explicitly.
We show that AUGVIC helps to attenuate the discrepancies between relevant and distant-domain monolingual data in traditional back-translation.
arXiv Detail & Related papers (2021-06-09T15:29:18Z) - FDMT: A Benchmark Dataset for Fine-grained Domain Adaptation in Machine
Translation [53.87731008029645]
We present a real-world fine-grained domain adaptation task in machine translation (FDMT)
The FDMT dataset consists of four sub-domains of information technology: autonomous vehicles, AI education, real-time networks and smart phone.
We make quantitative experiments and deep analyses in this new setting, which benchmarks the fine-grained domain adaptation task.
arXiv Detail & Related papers (2020-12-31T17:15:09Z) - Named Entity Recognition for Social Media Texts with Semantic
Augmentation [70.44281443975554]
Existing approaches for named entity recognition suffer from data sparsity problems when conducted on short and informal texts.
We propose a neural-based approach to NER for social media texts where both local (from running text) and augmented semantics are taken into account.
arXiv Detail & Related papers (2020-10-29T10:06:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.