Related papers: A Comprehensive Benchmark of Language Models on Unicode and Romanized Sinhala

A Comprehensive Benchmark of Language Models on Unicode and Romanized Sinhala

URL: http://arxiv.org/abs/2601.14958v1
Date: Wed, 21 Jan 2026 12:58:46 GMT
Title: A Comprehensive Benchmark of Language Models on Unicode and Romanized Sinhala
Authors: Minuri Rajapakse, Ruvan Weerasinghe,
Abstract summary: This paper presents a benchmark of modern Language Models (LMs) on a diverse corpus of Unicode and Romanized Sinhala.<n>We evaluate open-source models using perplexity, a measure of how well a model predicts a text, and leading closed-source models.
Score: 0.2864713389096699
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The performance of Language Models (LMs) on lower-resource, morphologically rich languages like Sinhala remains under-explored, particularly for Romanized Sinhala, which is prevalent in digital communication. This paper presents a comprehensive benchmark of modern LMs on a diverse corpus of Unicode and Romanized Sinhala. We evaluate open-source models using perplexity, a measure of how well a model predicts a text, and leading closed-source models via a qualitative analysis of sentence completion. Our findings reveal that the Mistral-Nemo-Base-2407 model achieves the strongest predictive performance on Unicode text and the Mistral-7B-v0.3 model for Romanized text. The results also highlight the strong all-around performance of the Llama-3.1-8B model for both scripts. Furthermore, a significant performance disparity exists among closed-source models: Gemini-1.5-pro and DeepSeek excel at Unicode generation, whereas Claude-3.5-Sonnet is superior at handling Romanized text. These results provide an essential guide for practitioners selecting models for Sinhala-specific applications and highlight the critical role of training data in handling script variations.

Related papers

Towards Open Foundation Language Model and Corpus for Macedonian: A Low-Resource Language [4.276396344868335]
We create resources to facilitate the adoption of Large Language Models (LLMs)<n>We collect the largest Macedonian corpus to date, consisting of 40GB of textual data and totaling 3.5B words.<n>We train domestic-yak, a state-of-the-art 8B- parameter model, on our curated datasets and evaluate it against eight baseline models.
arXiv Detail & Related papers (2025-06-11T09:46:58Z)
Subasa -- Adapting Language Models for Low-resourced Offensive Language Detection in Sinhala [9.298909305675595]
We introduce four models: "Subasa-XLM-R", which incorporates an intermediate Pre-Finetuning step using Masked Rationale Prediction.<n>We evaluate our models on the SOLD benchmark dataset for Sinhala offensive language detection.
arXiv Detail & Related papers (2025-04-02T23:46:49Z)
LLMic: Romanian Foundation Language Model [76.09455151754062]
We present LLMic, a foundation language model designed specifically for the Romanian Language.<n>We show that fine-tuning LLMic for language translation after the initial pretraining phase outperforms existing solutions in English-to-Romanian translation tasks.
arXiv Detail & Related papers (2025-01-13T22:14:45Z)
TEncDM: Understanding the Properties of the Diffusion Model in the Space of Language Model Encodings [35.18238858796925]
TEncDM is a novel approach to diffusion modeling that operates in the space of pre-trained language model encodings.<n>In our approach, we also employ a transformer-based decoder, specifically designed to incorporate context in the token prediction process.
arXiv Detail & Related papers (2024-02-29T12:25:45Z)
Gl\'orIA - A Generative and Open Large Language Model for Portuguese [4.782288068552145]
We introduce Gl'orIA, a robust European Portuguese decoder LLM. To pre-train Gl'orIA, we assembled a comprehensive PT-PT text corpus comprising 35 billion tokens from various sources. Evaluation shows that Gl'orIA significantly outperforms existing open PT decoder models in language modeling.
arXiv Detail & Related papers (2024-02-20T12:36:40Z)
Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral. This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z)
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research [139.69207791947738]
Dolma is a three-trillion-token English corpus built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials. We document Dolma, including its design principles, details about its construction, and a summary of its contents. We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices.
arXiv Detail & Related papers (2024-01-31T20:29:50Z)
Text Summarization Using Large Language Models: A Comparative Study of MPT-7b-instruct, Falcon-7b-instruct, and OpenAI Chat-GPT Models [0.0]
Leveraging Large Language Models (LLMs) has shown remarkable promise in enhancing summarization techniques. This paper embarks on an exploration of text summarization with a diverse set of LLMs, including MPT-7b-instruct, falcon-7b-instruct, and OpenAI ChatGPT text-davinci-003 models.
arXiv Detail & Related papers (2023-10-16T14:33:02Z)
An Open Dataset and Model for Language Identification [84.15194457400253]
We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033 across 201 languages. We make both the model and the dataset available to the research community.
arXiv Detail & Related papers (2023-05-23T08:43:42Z)
mFACE: Multilingual Summarization with Factual Consistency Evaluation [79.60172087719356]
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets. Despite promising results, current models still suffer from generating factually inconsistent summaries. We leverage factual consistency evaluation models to improve multilingual summarization.
arXiv Detail & Related papers (2022-12-20T19:52:41Z)
ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Spotting [121.11880210592497]
We argue that the limited capacity of language models comes from 1) implicit language modeling; 2) unidirectional feature representation; and 3) language model with noise input. We propose an autonomous, bidirectional and iterative ABINet++ for scene text spotting.
arXiv Detail & Related papers (2022-11-19T03:50:33Z)
Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training [86.91380874390778]
We present Generation-Augmented Pre-training (GAP), that jointly learns representations of natural language utterances and table schemas by leveraging generation models to generate pre-train data. Based on experimental results, neural semantics that leverage GAP MODEL obtain new state-of-the-art results on both SPIDER and CRITERIA-TO-generative benchmarks.
arXiv Detail & Related papers (2020-12-18T15:53:50Z)
The birth of Romanian BERT [1.377045689881944]
This paper introduces Romanian BERT, the first purely Romanian transformer-based language model, pretrained on a large text corpus. We discuss corpus composition and cleaning, the model training process, as well as an extensive evaluation of the model on various Romanian datasets.
arXiv Detail & Related papers (2020-09-18T09:30:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.