Establishing Baselines for Text Classification in Low-Resource Languages
- URL: http://arxiv.org/abs/2005.02068v1
- Date: Tue, 5 May 2020 11:17:07 GMT
- Title: Establishing Baselines for Text Classification in Low-Resource Languages
- Authors: Jan Christian Blaise Cruz and Charibeth Cheng
- Abstract summary: We introduce two previously unreleased datasets as benchmark datasets for text classification.
Second, we pretrain better BERT and DistilBERT models for use within the Filipino setting.
Third, we introduce a simple degradation test that benchmarks a model's resistance to performance degradation as the number of training samples are reduced.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: While transformer-based finetuning techniques have proven effective in tasks
that involve low-resource, low-data environments, a lack of properly
established baselines and benchmark datasets make it hard to compare different
approaches that are aimed at tackling the low-resource setting. In this work,
we provide three contributions. First, we introduce two previously unreleased
datasets as benchmark datasets for text classification and low-resource
multilabel text classification for the low-resource language Filipino. Second,
we pretrain better BERT and DistilBERT models for use within the Filipino
setting. Third, we introduce a simple degradation test that benchmarks a
model's resistance to performance degradation as the number of training samples
are reduced. We analyze our pretrained model's degradation speeds and look
towards the use of this method for comparing models aimed at operating within
the low-resource setting. We release all our models and datasets for the
research community to use.
Related papers
- Improving Sampling Methods for Fine-tuning SentenceBERT in Text Streams [49.3179290313959]
This study explores the efficacy of seven text sampling methods designed to selectively fine-tune language models.
We precisely assess the impact of these methods on fine-tuning the SBERT model using four different loss functions.
Our findings indicate that Softmax loss and Batch All Triplets loss are particularly effective for text stream classification.
arXiv Detail & Related papers (2024-03-18T23:41:52Z) - MoSECroT: Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer [50.40191599304911]
We introduce MoSECroT Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer.
In this paper, we present the first framework that leverages relative representations to construct a common space for the embeddings of a source language PLM and the static word embeddings of a target language.
We show that although our proposed framework is competitive with weak baselines when addressing MoSECroT, it fails to achieve competitive results compared with some strong baselines.
arXiv Detail & Related papers (2024-01-09T21:09:07Z) - Strategies for improving low resource speech to text translation relying
on pre-trained ASR models [59.90106959717875]
This paper presents techniques and findings for improving the performance of low-resource speech to text translation (ST)
We conducted experiments on both simulated and real-low resource setups, on language pairs English - Portuguese, and Tamasheq - French respectively.
arXiv Detail & Related papers (2023-05-31T21:58:07Z) - High-Resource Methodological Bias in Low-Resource Investigations [27.419604203739052]
We show that down sampling from a high-resource language results in datasets with different properties than the low-resource datasets.
We conclude that naive down sampling of datasets results in a biased view of how well these systems work in a low-resource scenario.
arXiv Detail & Related papers (2022-11-14T17:04:38Z) - Semi-Supervised Learning Based on Reference Model for Low-resource TTS [32.731900584216724]
We propose a semi-supervised learning method for neural TTS in which labeled target data is limited.
Experimental results show that our proposed semi-supervised learning scheme with limited target data significantly improves the voice quality for test data to achieve naturalness and robustness in speech synthesis.
arXiv Detail & Related papers (2022-10-25T07:48:07Z) - Towards Realistic Low-resource Relation Extraction: A Benchmark with
Empirical Baseline Study [51.33182775762785]
This paper presents an empirical study to build relation extraction systems in low-resource settings.
We investigate three schemes to evaluate the performance in low-resource settings: (i) different types of prompt-based methods with few-shot labeled data; (ii) diverse balancing methods to address the long-tailed distribution issue; and (iii) data augmentation technologies and self-training to generate more labeled in-domain data.
arXiv Detail & Related papers (2022-10-19T15:46:37Z) - Variational Information Bottleneck for Effective Low-Resource
Fine-Tuning [40.66716433803935]
We propose to use Variational Information Bottleneck (VIB) to suppress irrelevant features when fine-tuning on low-resource target tasks.
We show that our VIB model finds sentence representations that are more robust to biases in natural language inference datasets.
arXiv Detail & Related papers (2021-06-10T03:08:13Z) - Fine-tuning BERT for Low-Resource Natural Language Understanding via
Active Learning [30.5853328612593]
In this work, we explore fine-tuning methods of BERT -- a pre-trained Transformer based language model.
Our experimental results show an advantage in model performance by maximizing the approximate knowledge gain of the model.
We analyze the benefits of freezing layers of the language model during fine-tuning to reduce the number of trainable parameters.
arXiv Detail & Related papers (2020-12-04T08:34:39Z) - DAGA: Data Augmentation with a Generation Approach for Low-resource
Tagging Tasks [88.62288327934499]
We propose a novel augmentation method with language models trained on the linearized labeled sentences.
Our method is applicable to both supervised and semi-supervised settings.
arXiv Detail & Related papers (2020-11-03T07:49:15Z) - BREEDS: Benchmarks for Subpopulation Shift [98.90314444545204]
We develop a methodology for assessing the robustness of models to subpopulation shift.
We leverage the class structure underlying existing datasets to control the data subpopulations that comprise the training and test distributions.
Applying this methodology to the ImageNet dataset, we create a suite of subpopulation shift benchmarks of varying granularity.
arXiv Detail & Related papers (2020-08-11T17:04:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.