Related papers: Tokenizations for Austronesian Language Models: study on languages in Indonesia Archipelago

Tokenizations for Austronesian Language Models: study on languages in Indonesia Archipelago

URL: http://arxiv.org/abs/2602.06998v1
Date: Wed, 28 Jan 2026 14:02:19 GMT
Title: Tokenizations for Austronesian Language Models: study on languages in Indonesia Archipelago
Authors: Andhika Bernard Lumbantobing, Hokky Situngkir,
Abstract summary: This study aimed to develop a syllable-based tokenization framework adopting principles from traditional Indonesian scripts (aksara) for regional languages of Indonesia.<n> Evaluation was conducted on the NusaX dataset comprising 1,000 parallel translation samples across 10 regional languages, Indonesian, and English.<n>Results demonstrated that syllable-based tokenization yielded consistent TPC values across all regional languages, whereas GPT-2 exhibited an inverse pattern with the lowest TPC for English.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Tokenization constitutes a fundamental stage in Large Language Model (LLM) processing; however, subword-based tokenization methods optimized on English-dominant corpora may produce token fragmentation misaligned with the linguistic structures of Austronesian languages. This study aimed to develop a syllable-based tokenization framework adopting principles from traditional Indonesian scripts (aksara) for regional languages of Indonesia. A syllabic segmentation procedure was constructed based on the logic of abugida writing systems and implemented with a vocabulary of 2,843 tokens extracted from the Indonesian dictionary (KBBI). Evaluation was conducted on the NusaX dataset comprising 1,000 parallel translation samples across 10 regional languages, Indonesian, and English. Analysis employed Token per Character (TPC) ratio and sequence alignment using the Smith-Waterman algorithm. Results demonstrated that syllable-based tokenization yielded consistent TPC values across all regional languages, whereas GPT-2 exhibited an inverse pattern with the lowest TPC for English. Syllable-based tokenization consistently produced higher token sequence similarity scores, with an average increase of approximately 21% compared to GPT-2. These findings confirm that the syllable-based approach more effectively preserves phonological and morphological patterns across related Austronesian languages, offering a linguistically principled foundation for multilingual LLM development.

Related papers

Syllabic Agglutinative Tokenizations for Indonesian LLM: A Study from Gasing Literacy Learning System [0.0]
This paper presents a novel syllable-based tokenization approach for Indonesian large language models.<n>We develop a tokenization framework that segments Indonesian text at syllable boundaries before applying byte-pair encoding.<n>Our approach first identifies high-frequency syllables through rule-based segmentation, then constructs a compact vocabulary of 3,500 tokens.
arXiv Detail & Related papers (2026-01-14T17:47:24Z)
Tokenization Disparities as Infrastructure Bias: How Subword Systems Create Inequities in LLM Access and Efficiency [6.943451388015595]
Tokenization disparities pose a significant barrier to achieving equitable access to artificial intelligence.<n>This study conducts a large-scale cross-linguistic evaluation of tokenization efficiency in over 200 languages.
arXiv Detail & Related papers (2025-10-14T11:14:38Z)
Tokens with Meaning: A Hybrid Tokenization Approach for NLP [0.2826977330147589]
Tokenization plays a pivotal role in natural language processing (NLP)<n>We introduce a hybrid tokenization framework that combines rule-based morphological analysis with statistical subword segmentation.<n>The method uses phono normalization, root-affix, and a novel algorithm that balances morpheme preservation with vocabulary efficiency.
arXiv Detail & Related papers (2025-08-19T22:17:42Z)
Doğal Dil İşlemede Tokenizasyon Standartları ve Ölçümü: Türkçe Üzerinden Büyük Dil Modellerinin Karşılaştırmalı Analizi [0.29687381456163997]
This study introduces a novel evaluation framework addressing tokenization challenges specific to morphologically-rich and low-resource languages such as Turkish.<n>We assessed tokenizers based on vocabulary size, token count, processing time, language-specific token percentages (%TR), and token purity (%Pure)<n>Our analysis reveals that language-specific token percentages exhibit a stronger correlation with downstream performance (e.g., MMLU scores) than token purity.
arXiv Detail & Related papers (2025-08-18T16:26:42Z)
Tokenization Standards for Linguistic Integrity: Turkish as a Benchmark [0.29687381456163997]
Tokenization is a fundamental preprocessing step in NLP, directly impacting large language models' ability to capture syntactic, morphosyntactic, and semantic structures.<n>This paper introduces a novel framework for evaluating tokenization strategies, addressing challenges in morphologically rich and low-resource languages.
arXiv Detail & Related papers (2025-02-10T21:47:49Z)
Enhancing Multilingual ASR for Unseen Languages via Language Embedding Modeling [50.62091603179394]
Whisper, one of the most advanced ASR models, handles 99 languages effectively.<n>However, Whisper struggles with unseen languages, those not included in its pre-training.<n>We propose methods that exploit these relationships to enhance ASR performance on unseen languages.
arXiv Detail & Related papers (2024-12-21T04:05:43Z)
Cross-domain Chinese Sentence Pattern Parsing [67.1381983012038]
Sentence Pattern Structure (SPS) parsing is a syntactic analysis method primarily employed in language teaching. Existing SPSs rely heavily on textbook corpora for training, lacking cross-domain capability. This paper proposes an innovative approach leveraging large language models (LLMs) within a self-training framework.
arXiv Detail & Related papers (2024-02-26T05:30:48Z)
VECO 2.0: Cross-lingual Language Model Pre-training with Multi-granularity Contrastive Learning [56.47303426167584]
We propose a cross-lingual pre-trained model VECO2.0 based on contrastive learning with multi-granularity alignments. Specifically, the sequence-to-sequence alignment is induced to maximize the similarity of the parallel pairs and minimize the non-parallel pairs. token-to-token alignment is integrated to bridge the gap between synonymous tokens excavated via the thesaurus dictionary from the other unpaired tokens in a bilingual instance.
arXiv Detail & Related papers (2023-04-17T12:23:41Z)
CROP: Zero-shot Cross-lingual Named Entity Recognition with Multilingual Labeled Sequence Translation [113.99145386490639]
Cross-lingual NER can transfer knowledge between languages via aligned cross-lingual representations or machine translation results. We propose a Cross-lingual Entity Projection framework (CROP) to enable zero-shot cross-lingual NER. We adopt a multilingual labeled sequence translation model to project the tagged sequence back to the target language and label the target raw sentence.
arXiv Detail & Related papers (2022-10-13T13:32:36Z)
Improving Mandarin End-to-End Speech Recognition with Word N-gram Language Model [57.92200214957124]
External language models (LMs) are used to improve the recognition performance of end-to-end (E2E) automatic speech recognition (ASR) systems. We propose a novel decoding algorithm where a word-level lattice is constructed on-the-fly to consider all possible word sequences. Our method consistently outperforms subword-level LMs, including N-gram LM and neural network LM.
arXiv Detail & Related papers (2022-01-06T10:04:56Z)
A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space. We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance. We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.