IndiText Boost: Text Augmentation for Low Resource India Languages
- URL: http://arxiv.org/abs/2401.13085v1
- Date: Tue, 23 Jan 2024 20:54:40 GMT
- Title: IndiText Boost: Text Augmentation for Low Resource India Languages
- Authors: Onkar Litake, Niraj Yagnik and Shreyas Labhsetwar
- Abstract summary: We focus on implementing techniques like Easy Data Augmentation, Back Translation, Paraphrasing, Text Generation using LLMs, and Text Expansion using LLMs for text classification on different languages.
According to our knowledge, no such work exists for text augmentation on Indian languages.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text Augmentation is an important task for low-resource languages. It helps
deal with the problem of data scarcity. A data augmentation strategy is used to
deal with the problem of data scarcity. Through the years, much work has been
done on data augmentation for the English language. In contrast, very less work
has been done on Indian languages. This is contrary to the fact that data
augmentation is used to deal with data scarcity. In this work, we focus on
implementing techniques like Easy Data Augmentation, Back Translation,
Paraphrasing, Text Generation using LLMs, and Text Expansion using LLMs for
text classification on different languages. We focus on 6 Indian languages
namely: Sindhi, Marathi, Hindi, Gujarati, Telugu, and Sanskrit. According to
our knowledge, no such work exists for text augmentation on Indian languages.
We carry out binary as well as multi-class text classification to make our
results more comparable. We get surprising results as basic data augmentation
techniques surpass LLMs.
Related papers
- CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text.
We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules.
COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv Detail & Related papers (2024-06-16T16:10:51Z) - Why Not Transform Chat Large Language Models to Non-English? [57.16587777261422]
The scarcity of non-English data limits the development of non-English large language models (LLMs)
TransLLM divides the transfer problem into some common sub-tasks with the translation chain-of-thought.
Our method, using only single-turn data, outperforms strong baselines and ChatGPT on multi-turn benchmark MT-bench.
arXiv Detail & Related papers (2024-05-22T18:53:25Z) - IndicLLMSuite: A Blueprint for Creating Pre-training and Fine-Tuning
Datasets for Indian Languages [37.79850860981589]
This work introduces an expansive suite of resources specifically designed for the development of Indic LLMs.
Our approach combines highly curated manually verified data, unverified yet valuable data, and synthetic data.
For instruction-fine tuning, we amalgamate existing Indic datasets, translate/transliterate English datasets into Indian languages, and utilize LLaMa2 and Mixtral models.
arXiv Detail & Related papers (2024-03-11T00:46:56Z) - UltraLink: An Open-Source Knowledge-Enhanced Multilingual Supervised
Fine-tuning Dataset [69.33424532827608]
Open-source large language models (LLMs) have gained significant strength across diverse fields.
In this work, we construct an open-source multilingual supervised fine-tuning dataset.
The resulting UltraLink dataset comprises approximately 1 million samples across five languages.
arXiv Detail & Related papers (2024-02-07T05:05:53Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Breaking Language Barriers: A Question Answering Dataset for Hindi and
Marathi [1.03590082373586]
This paper focuses on developing a Question Answering dataset for two such languages- Hindi and Marathi.
Despite Hindi being the 3rd most spoken language worldwide, and Marathi being the 11th most spoken language globally, both languages face limited resources for building efficient Question Answering systems.
We release the largest Question-Answering dataset available for these languages, with each dataset containing 28,000 samples.
arXiv Detail & Related papers (2023-08-19T00:39:21Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - Transfer Learning for Scene Text Recognition in Indian Languages [27.609596088151644]
We investigate the power of transfer learning for all the layers of deep scene text recognition networks from English to two common Indian languages.
We show that the transfer of English models to simple synthetic datasets of Indian languages is not practical.
We set new benchmarks for scene-text recognition on Hindi, Telugu, and Malayalam datasets from IIIT-ILST and Bangla dataset from MLT-17.
arXiv Detail & Related papers (2022-01-10T06:14:49Z) - Hate and Offensive Speech Detection in Hindi and Marathi [0.0]
Still hate and offensive speech detection faces a challenge due to inadequate availability of data.
In this work, we consider hate and offensive speech detection in Hindi and Marathi texts.
We explore different deep learning architectures like CNN, LSTM, and variations of BERT like multilingual BERT, IndicBERT, and monolingual RoBERTa.
We show that the transformer-based models perform the best and even the basic models along with FastText embeddings give a competitive performance.
arXiv Detail & Related papers (2021-10-23T11:57:36Z) - Cross-lingual Offensive Language Identification for Low Resource
Languages: The Case of Marathi [2.4737119633827174]
MOLD is the first dataset of its kind compiled for Marathi, opening a new domain for research in low-resource Indo-Aryan languages.
We present results from several machine learning experiments on this dataset, including zero-short and other transfer learning experiments on state-of-the-art cross-lingual transformers.
arXiv Detail & Related papers (2021-09-08T11:29:44Z) - Facebook AI's WMT20 News Translation Task Submission [69.92594751788403]
This paper describes Facebook AI's submission to WMT20 shared news translation task.
We focus on the low resource setting and participate in two language pairs, Tamil -> English and Inuktitut -> English.
We approach the low resource problem using two main strategies, leveraging all available data and adapting the system to the target news domain.
arXiv Detail & Related papers (2020-11-16T21:49:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.