Improving Indonesian Text Classification Using Multilingual Language
Model
- URL: http://arxiv.org/abs/2009.05713v1
- Date: Sat, 12 Sep 2020 03:16:25 GMT
- Title: Improving Indonesian Text Classification Using Multilingual Language
Model
- Authors: Ilham Firdausi Putra (1), Ayu Purwarianti (1 and 2) ((1) Institut
Teknologi Bandung, (2) U-CoE AI-VLB)
- Abstract summary: This paper investigates the effect of combining English and Indonesian data on building Indonesian text classification models.
The experiment showed that the addition of English data, especially if the amount of Indonesian data is small, improves performance.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Compared to English, the amount of labeled data for Indonesian text
classification tasks is very small. Recently developed multilingual language
models have shown its ability to create multilingual representations
effectively. This paper investigates the effect of combining English and
Indonesian data on building Indonesian text classification (e.g., sentiment
analysis and hate speech) using multilingual language models. Using the
feature-based approach, we observe its performance on various data sizes and
total added English data. The experiment showed that the addition of English
data, especially if the amount of Indonesian data is small, improves
performance. Using the fine-tuning approach, we further showed its
effectiveness in utilizing the English language to build Indonesian text
classification models.
Related papers
- Towards Building an End-to-End Multilingual Automatic Lyrics Transcription Model [14.39119862985503]
We aim to create a multilingual ALT system with available datasets.
Inspired by architectures that have been proven effective for English ALT, we adapt these techniques to the multilingual scenario.
We evaluate the performance of the multilingual model in comparison to its monolingual counterparts.
arXiv Detail & Related papers (2024-06-25T15:02:32Z) - Multilingual Diversity Improves Vision-Language Representations [66.41030381363244]
Pre-training on this dataset outperforms using English-only or English-dominated datasets on ImageNet.
On a geographically diverse task like GeoDE, we also observe improvements across all regions, with the biggest gain coming from Africa.
arXiv Detail & Related papers (2024-05-27T08:08:51Z) - Cendol: Open Instruction-tuned Generative Large Language Models for Indonesian Languages [55.963648108438555]
Large language models (LLMs) show remarkable human-like capability in various domains and languages.
We introduce Cendol, a collection of Indonesian LLMs encompassing both decoder-only and encoder-decoder architectures.
We highlight Cendol's effectiveness across a diverse array of tasks, attaining 20% improvement, and demonstrate its capability to generalize.
arXiv Detail & Related papers (2024-04-09T09:04:30Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - Leveraging Language Identification to Enhance Code-Mixed Text
Classification [0.7340017786387767]
Existing deep-learning models do not take advantage of the implicit language information in code-mixed text.
Our study aims to improve BERT-based models performance on low-resource Code-Mixed Hindi-English datasets.
arXiv Detail & Related papers (2023-06-08T06:43:10Z) - Improving Polish to English Neural Machine Translation with Transfer
Learning: Effects of Data Volume and Language Similarity [2.4674086273775035]
We investigate the impact of data volume and the use of similar languages on transfer learning in a machine translation task.
We fine-tune mBART model for a Polish-English translation task using the OPUS-100 dataset.
Our experiments show that a combination of related languages and larger amounts of data outperforms the model trained on related languages or larger amounts of data alone.
arXiv Detail & Related papers (2023-06-01T13:34:21Z) - Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - Multilingual Neural Semantic Parsing for Low-Resourced Languages [1.6244541005112747]
We introduce a new multilingual semantic parsing dataset in English, Italian and Japanese.
We show that joint multilingual training with pretrained encoders substantially outperforms our baselines on the TOP dataset.
We find that a semantic trained only on English data achieves a zero-shot performance of 44.9% exact-match accuracy on Italian sentences.
arXiv Detail & Related papers (2021-06-07T09:53:02Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.