GlotScript: A Resource and Tool for Low Resource Writing System Identification
- URL: http://arxiv.org/abs/2309.13320v2
- Date: Wed, 27 Mar 2024 14:57:29 GMT
- Title: GlotScript: A Resource and Tool for Low Resource Writing System Identification
- Authors: Amir Hossein Kargaran, François Yvon, Hinrich Schütze,
- Abstract summary: GlotScript is an open resource for low resource writing system identification.
GlotScript-R provides attested writing systems for more than 7,000 languages.
GlotScript-T is a writing system identification tool that covers all 161 Unicode 15.0 scripts.
- Score: 53.56700754408902
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We present GlotScript, an open resource and tool for low resource writing system identification. GlotScript-R is a resource that provides the attested writing systems for more than 7,000 languages. It is compiled by aggregating information from existing writing system resources. GlotScript-T is a writing system identification tool that covers all 161 Unicode 15.0 scripts. For an input text, it returns its script distribution where scripts are identified by ISO 15924 codes. We also present two use cases for GlotScript. First, we demonstrate that GlotScript can help cleaning multilingual corpora such as mC4 and OSCAR. Second, we analyze the tokenization of a number of language models such as GPT-4 using GlotScript and provide insights on the coverage of low resource scripts and languages by each language model. We hope that GlotScript will become a useful resource for work on low resource languages in the NLP community. GlotScript-R and GlotScript-T are available at https://github.com/cisnlp/GlotScript.
Related papers
- Exploring the Role of Transliteration in In-Context Learning for Low-resource Languages Written in Non-Latin Scripts [50.40191599304911]
We investigate whether transliteration is also effective in improving LLMs' performance for low-resource languages written in non-Latin scripts.
We propose three prompt templates, where the target-language text is represented in (1) its original script, (2) Latin script, or (3) both.
Our findings show that the effectiveness of transliteration varies by task type and model size.
arXiv Detail & Related papers (2024-07-02T14:51:20Z) - Script-Agnostic Language Identification [21.19710835737713]
Many modern languages, such as Konkani, Kashmiri, Punjabi etc., are synchronically written in several scripts.
We propose learning script-agnostic representations using several different experimental strategies.
We find that word-level script randomization and exposure to a language written in multiple scripts is extremely valuable for downstream script-agnostic language identification.
arXiv Detail & Related papers (2024-06-25T19:23:42Z) - TransliCo: A Contrastive Learning Framework to Address the Script Barrier in Multilingual Pretrained Language Models [50.40191599304911]
We propose TransliCo to fine-tune an mPLM by contrasting sentences in its training data and their transliterations in a unified script.
We show that Furina outperforms the original Glot500-m on various zero-shot crosslingual transfer tasks.
arXiv Detail & Related papers (2024-01-12T15:12:48Z) - GlotLID: Language Identification for Low-Resource Languages [51.38634652914054]
GlotLID-M is an LID model that satisfies the desiderata of wide coverage, reliability and efficiency.
It identifies 1665 languages, a large increase in coverage compared to prior work.
arXiv Detail & Related papers (2023-10-24T23:45:57Z) - Visual Speech Recognition for Languages with Limited Labeled Data using
Automatic Labels from Whisper [96.43501666278316]
This paper proposes a powerful Visual Speech Recognition (VSR) method for multiple languages.
We employ a Whisper model which can conduct both language identification and audio-based speech recognition.
By comparing the performances of VSR models trained on automatic labels and the human-annotated labels, we show that we can achieve similar VSR performance to that of human-annotated labels.
arXiv Detail & Related papers (2023-09-15T16:53:01Z) - GAS-Net: Generative Artistic Style Neural Networks for Fonts [8.569974263629218]
This project aims to develop a few-shot cross-lingual font generator based on AGIS-Net.
Our approaches include redesigning the encoder and the loss function.
We will validate our method on multiple languages and datasets mentioned.
arXiv Detail & Related papers (2022-12-06T11:23:16Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Handwritten Script Identification from Text Lines [38.1188690493442]
We propose a robust method towards identifying scripts from handwritten documents at text line-level.
The recognition is based upon features extracted using Chain Code Histogram (CCH) and Discrete Fourier Transform (DFT)
The proposed method is experimented on 800 handwritten text lines written in seven Indic scripts namely, Gujarati, Kannada, Malayalam, Oriya, Tamil, Telugu, Urdu along with Roman script.
arXiv Detail & Related papers (2020-09-16T02:43:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.