Performance of Data Augmentation Methods for Brazilian Portuguese Text
Classification
- URL: http://arxiv.org/abs/2304.02785v1
- Date: Wed, 5 Apr 2023 23:13:37 GMT
- Title: Performance of Data Augmentation Methods for Brazilian Portuguese Text
Classification
- Authors: Marcellus Amadeus and Paulo Branco
- Abstract summary: In this work, we took advantage of different existing data augmentation methods to analyze their performances applied to text classification problems using Brazilian Portuguese corpora.
Our analysis shows some putative improvements in using some of these techniques; however, it also suggests further exploitation of language bias and non-English text data scarcity.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Improving machine learning performance while increasing model generalization
has been a constantly pursued goal by AI researchers. Data augmentation
techniques are often used towards achieving this target, and most of its
evaluation is made using English corpora. In this work, we took advantage of
different existing data augmentation methods to analyze their performances
applied to text classification problems using Brazilian Portuguese corpora. As
a result, our analysis shows some putative improvements in using some of these
techniques; however, it also suggests further exploitation of language bias and
non-English text data scarcity.
Related papers
- Evaluating the Effectiveness of Data Augmentation for Emotion Classification in Low-Resource Settings [1.387446067205368]
We evaluated the effectiveness of different data augmentation techniques for a multi-label emotion classification task using a low-resource dataset.
Back Translation outperformed autoencoder-based approaches and that generating multiple examples per training instance led to further performance improvement.
arXiv Detail & Related papers (2024-06-07T18:13:27Z) - On the Robustness of Language Guidance for Low-Level Vision Tasks: Findings from Depth Estimation [71.72465617754553]
We generate "low-level" sentences that convey object-centric, three-dimensional spatial relationships, incorporate them as additional language priors and evaluate their downstream impact on depth estimation.
Our key finding is that current language-guided depth estimators perform optimally only with scene-level descriptions.
Despite leveraging additional data, these methods are not robust to directed adversarial attacks and decline in performance with an increase in distribution shift.
arXiv Detail & Related papers (2024-04-12T15:35:20Z) - Improving Sampling Methods for Fine-tuning SentenceBERT in Text Streams [49.3179290313959]
This study explores the efficacy of seven text sampling methods designed to selectively fine-tune language models.
We precisely assess the impact of these methods on fine-tuning the SBERT model using four different loss functions.
Our findings indicate that Softmax loss and Batch All Triplets loss are particularly effective for text stream classification.
arXiv Detail & Related papers (2024-03-18T23:41:52Z) - American Sign Language Video to Text Translation [0.0]
Sign language to text is a crucial technology that can break down communication barriers for individuals with hearing difficulties.
We evaluate models using BLEU and rBLEU metrics to ensure translation quality.
arXiv Detail & Related papers (2024-02-11T17:46:33Z) - Language Model Decoding as Direct Metrics Optimization [87.68281625776282]
Current decoding methods struggle to generate texts that align with human texts across different aspects.
In this work, we frame decoding from a language model as an optimization problem with the goal of strictly matching the expected performance with human texts.
We prove that this induced distribution is guaranteed to improve the perplexity on human texts, which suggests a better approximation to the underlying distribution of human texts.
arXiv Detail & Related papers (2023-10-02T09:35:27Z) - Robust Sentiment Analysis for Low Resource languages Using Data
Augmentation Approaches: A Case Study in Marathi [0.9553673944187253]
Sentiment analysis plays a crucial role in understanding the sentiment expressed in text data.
There exists a significant gap in research efforts for sentiment analysis in low-resource languages.
We present an exhaustive study of data augmentation approaches for the low-resource Indic language Marathi.
arXiv Detail & Related papers (2023-10-01T17:09:31Z) - Sentiment Analysis on Brazilian Portuguese User Reviews [0.0]
This work analyzes the predictive performance of a range of document embedding strategies, assuming the polarity as the system outcome.
This analysis includes five sentiment analysis datasets in Brazilian Portuguese, unified in a single dataset, and a reference partitioning in training, testing, and validation sets, both made publicly available through a digital repository.
arXiv Detail & Related papers (2021-12-10T11:18:26Z) - On the Language Coverage Bias for Neural Machine Translation [81.81456880770762]
Language coverage bias is important for neural machine translation (NMT) because the target-original training data is not well exploited in current practice.
By carefully designing experiments, we provide comprehensive analyses of the language coverage bias in the training data.
We propose two simple and effective approaches to alleviate the language coverage bias problem.
arXiv Detail & Related papers (2021-06-07T01:55:34Z) - Improving Cross-Lingual Reading Comprehension with Self-Training [62.73937175625953]
Current state-of-the-art models even surpass human performance on several benchmarks.
Previous works have revealed the abilities of pre-trained multilingual models for zero-shot cross-lingual reading comprehension.
This paper further utilized unlabeled data to improve the performance.
arXiv Detail & Related papers (2021-05-08T08:04:30Z) - Dynamic Data Selection and Weighting for Iterative Back-Translation [116.14378571769045]
We propose a curriculum learning strategy for iterative back-translation models.
We evaluate our models on domain adaptation, low-resource, and high-resource MT settings.
Experimental results demonstrate that our methods achieve improvements of up to 1.8 BLEU points over competitive baselines.
arXiv Detail & Related papers (2020-04-07T19:49:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.