NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural
- URL: http://arxiv.org/abs/2403.01817v1
- Date: Mon, 4 Mar 2024 08:05:34 GMT
- Title: NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural
- Authors: Wilson Wongso, David Samuel Setiawan, Steven Limcorn, Ananto
Joyoadikusumo
- Abstract summary: NusaBERT builds upon IndoBERT by incorporating vocabulary expansion and leveraging a diverse multilingual corpus that includes regional languages and dialects.
Through rigorous evaluation across a range of benchmarks, NusaBERT demonstrates state-of-the-art performance in tasks involving multiple languages of Indonesia.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Indonesia's linguistic landscape is remarkably diverse, encompassing over 700
languages and dialects, making it one of the world's most linguistically rich
nations. This diversity, coupled with the widespread practice of code-switching
and the presence of low-resource regional languages, presents unique challenges
for modern pre-trained language models. In response to these challenges, we
developed NusaBERT, building upon IndoBERT by incorporating vocabulary
expansion and leveraging a diverse multilingual corpus that includes regional
languages and dialects. Through rigorous evaluation across a range of
benchmarks, NusaBERT demonstrates state-of-the-art performance in tasks
involving multiple languages of Indonesia, paving the way for future natural
language understanding research for under-represented languages.
Related papers
- Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages [55.36534539177367]
This paper introduces Pangea, a multilingual multimodal large language model (MLLM) trained on a diverse 6M instruction dataset spanning 39 languages.
P Pangea significantly outperforms existing open-source models in multilingual settings and diverse cultural contexts.
We fully open-source our data, code, and trained checkpoints, to facilitate the development of inclusive and robust multilingual MLLMs.
arXiv Detail & Related papers (2024-10-21T16:19:41Z) - Multilingual Text Representation [3.4447129363520337]
Modern NLP breakthrough includes large multilingual models capable of performing tasks across more than 100 languages.
State-of-the-art language models came a long way, starting from the simple one-hot representation of words.
We discuss how the full potential of language democratization could be obtained, reaching beyond the known limits.
arXiv Detail & Related papers (2023-09-02T14:21:22Z) - Lexical Diversity in Kinship Across Languages and Dialects [6.80465507148218]
We introduce a method to enrich computational lexicons with content relating to linguistic diversity.
The method is verified through two large-scale case studies on kinship terminology.
arXiv Detail & Related papers (2023-08-24T19:49:30Z) - Multi-lingual and Multi-cultural Figurative Language Understanding [69.47641938200817]
Figurative language permeates human communication, but is relatively understudied in NLP.
We create a dataset for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba.
Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region.
All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data.
arXiv Detail & Related papers (2023-05-25T15:30:31Z) - One Country, 700+ Languages: NLP Challenges for Underrepresented
Languages and Dialects in Indonesia [60.87739250251769]
We provide an overview of the current state of NLP research for Indonesia's 700+ languages.
We highlight challenges in Indonesian NLP and how these affect the performance of current NLP systems.
arXiv Detail & Related papers (2022-03-24T22:07:22Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - To What Degree Can Language Borders Be Blurred In BERT-based
Multilingual Spoken Language Understanding? [7.245261469258502]
We show that although a BERT-based multilingual Spoken Language Understanding (SLU) model works substantially well even on distant language groups, there is still a gap to the ideal multilingual performance.
We propose a novel BERT-based adversarial model architecture to learn language-shared and language-specific representations for multilingual SLU.
arXiv Detail & Related papers (2020-11-10T09:59:24Z) - CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot
Cross-Lingual NLP [68.2650714613869]
We propose a data augmentation framework to generate multi-lingual code-switching data to fine-tune mBERT.
Compared with the existing work, our method does not rely on bilingual sentences for training, and requires only one training process for multiple target languages.
arXiv Detail & Related papers (2020-06-11T13:15:59Z) - Finding Universal Grammatical Relations in Multilingual BERT [47.74015366712623]
We show that subspaces of mBERT representations recover syntactic tree distances in languages other than English.
We present an unsupervised analysis method that provides evidence mBERT learns representations of syntactic dependency labels.
arXiv Detail & Related papers (2020-05-09T20:46:02Z) - The State and Fate of Linguistic Diversity and Inclusion in the NLP
World [12.936270946393483]
Language technologies contribute to promoting multilingualism and linguistic diversity around the world.
Only a very small number of the over 7000 languages of the world are represented in the rapidly evolving language technologies and applications.
arXiv Detail & Related papers (2020-04-20T07:19:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.