Data Augmentation and Hyperparameter Tuning for Low-Resource MFA
- URL: http://arxiv.org/abs/2504.07024v1
- Date: Wed, 09 Apr 2025 16:38:45 GMT
- Title: Data Augmentation and Hyperparameter Tuning for Low-Resource MFA
- Authors: Alessio Tosolini, Claire Bowern,
- Abstract summary: We use data augmentation methods to increase corpus size.<n>Unlike text augmentation, audio augmentation does not lead to substantially increased performance.<n>For languages with small to medium amounts of training data, this is a workable alternative to adapting models from high-resource languages.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: A continued issue for those working with computational tools and endangered and under-resourced languages is the lower accuracy of results for languages with smaller amounts of data. We attempt to ameliorate this issue by using data augmentation methods to increase corpus size, comparing augmentation to hyperparameter tuning for multilingual forced alignment. Unlike text augmentation methods, audio augmentation does not lead to substantially increased performance. Hyperparameter tuning, on the other hand, results in substantial improvement without (for this amount of data) infeasible additional training time. For languages with small to medium amounts of training data, this is a workable alternative to adapting models from high-resource languages.
Related papers
- Train More Parameters But Mind Their Placement: Insights into Language Adaptation with PEFT [0.8702432681310401]
We aim to enhance the generation performance of an LLM by specialising it using unstructured text corpora.<n>We find that increasing the number of trainable parameters leads to better and more robust language adaptation.<n>Although improvements are consistent in 0-shot summarisation, some adapted models struggle with longer context lengths.
arXiv Detail & Related papers (2024-12-17T08:44:00Z) - Training Bilingual LMs with Data Constraints in the Targeted Language [17.623676545426477]
We study how to boost pretrained model performance in a target language with insufficient pretraining data.<n>We study this by quantifying the performance gap between training with data in a data-rich auxiliary language compared with training in the target language.
arXiv Detail & Related papers (2024-11-20T02:27:40Z) - Reduce, Reuse, Recycle: Is Perturbed Data better than Other Language augmentation for Low Resource Self-Supervised Speech Models [48.44820587495038]
Self-supervised representation learning (SSRL) has demonstrated superior performance than supervised models for tasks including phoneme recognition.
Training SSRL models poses a challenge for low-resource languages where sufficient pre-training data may not be available.
We propose to use audio augmentation techniques, namely: pitch variation, noise addition, accented target language and other language speech to pre-train SSRL models in a low resource condition and evaluate phoneme recognition.
arXiv Detail & Related papers (2023-09-22T10:09:09Z) - Distributional Data Augmentation Methods for Low Resource Language [0.9208007322096533]
Easy data augmentation (EDA) augments the training data by injecting and replacing synonyms and randomly permuting sentences.
One major obstacle with EDA is the need for versatile and complete synonym dictionaries, which cannot be easily found in low-resource languages.
We propose two extensions, easy distributional data augmentation (EDDA) and type specific similar word replacement (TSSR), which uses semantic word context information and part-of-speech tags for word replacement and augmentation.
arXiv Detail & Related papers (2023-09-09T19:01:59Z) - Parameter-Efficient Neural Reranking for Cross-Lingual and Multilingual
Retrieval [66.69799641522133]
State-of-the-art neural (re)rankers are notoriously data hungry.
Current approaches typically transfer rankers trained on English data to other languages and cross-lingual setups by means of multilingual encoders.
We show that two parameter-efficient approaches to cross-lingual transfer, namely Sparse Fine-Tuning Masks (SFTMs) and Adapters, allow for a more lightweight and more effective zero-shot transfer.
arXiv Detail & Related papers (2022-04-05T15:44:27Z) - Visual Speech Recognition for Multiple Languages in the Wild [64.52593130370757]
We show that designing better VSR models is equally important to using larger training sets.
We propose the addition of prediction-based auxiliary tasks to a VSR model.
We show that such model works for different languages and outperforms all previous methods trained on publicly available datasets by a large margin.
arXiv Detail & Related papers (2022-02-26T07:21:00Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z) - Learning from Multiple Noisy Augmented Data Sets for Better
Cross-Lingual Spoken Language Understanding [69.40915115518523]
Lack of training data presents a grand challenge to scaling out spoken language understanding (SLU) to low-resource languages.
Various data augmentation approaches have been proposed to synthesize training data in low-resource target languages.
In this paper we focus on mitigating noise in augmented data.
arXiv Detail & Related papers (2021-09-03T15:44:15Z) - Demystify Optimization Challenges in Multilingual Transformers [21.245418118851884]
We study optimization challenges from loss landscape and parameter plasticity perspectives.
We find that imbalanced training data poses task interference between high and low resource languages.
We propose Curvature Aware Task Scaling (CATS) which improves both optimization and generalization especially for low resource.
arXiv Detail & Related papers (2021-04-15T17:51:03Z) - Balancing Training for Multilingual Neural Machine Translation [130.54253367251738]
multilingual machine translation (MT) models can translate to/from multiple languages.
Standard practice is to up-sample less resourced languages to increase representation.
We propose a method that instead automatically learns how to weight training data through a data scorer.
arXiv Detail & Related papers (2020-04-14T18:23:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.