Bilingual Word Level Language Identification for Omotic Languages
- URL: http://arxiv.org/abs/2509.07998v1
- Date: Fri, 05 Sep 2025 23:36:26 GMT
- Title: Bilingual Word Level Language Identification for Omotic Languages
- Authors: Mesay Gemeda Yigezu, Girma Yohannis Bade, Atnafu Lambebo Tonja, Olga Kolesnikova, Grigori Sidorov, Alexander Gelbukh,
- Abstract summary: This paper presents Bilingual Language Identification (BLID) for languages spoken in the southern part of Ethiopia, namely Wolaita and Gofa.<n>To overcome this challenge, we employed various experiments on various approaches.<n>The combination of the BERT based pretrained language model and LSTM approach performed better, with an F1 score of 0.72 on the test set.
- Score: 44.04646981451376
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Language identification is the task of determining the languages for a given text. In many real world scenarios, text may contain more than one language, particularly in multilingual communities. Bilingual Language Identification (BLID) is the task of identifying and distinguishing between two languages in a given text. This paper presents BLID for languages spoken in the southern part of Ethiopia, namely Wolaita and Gofa. The presence of words similarities and differences between the two languages makes the language identification task challenging. To overcome this challenge, we employed various experiments on various approaches. Then, the combination of the BERT based pretrained language model and LSTM approach performed better, with an F1 score of 0.72 on the test set. As a result, the work will be effective in tackling unwanted social media issues and providing a foundation for further research in this area.
Related papers
- ILID: Native Script Language Identification for Indian Languages [0.0]
Core challenge of language identification lies in distinguishing languages in noisy, short, and code-mixed environments.<n>We release a dataset of 250K sentences consisting of 23 languages including English and all 22 official Indian languages labeled with their language identifiers.<n>Our models outperforms the state-of-the-art pre-trained transformer models for the language identification task.
arXiv Detail & Related papers (2025-07-16T01:39:32Z) - BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages [93.92804151830744]
We present BRIGHTER, a collection of multi-labeled, emotion-annotated datasets in 28 different languages.<n>We highlight the challenges related to the data collection and annotation processes.<n>We show that the BRIGHTER datasets represent a meaningful step towards addressing the gap in text-based emotion recognition.
arXiv Detail & Related papers (2025-02-17T15:39:50Z) - How does a Multilingual LM Handle Multiple Languages? [0.0]
This study critically examines capabilities in multilingual understanding, semantic representation, and cross-lingual knowledge transfer.<n>It assesses semantic similarity by analyzing multilingual word embeddings for consistency using cosine similarity.<n>It examines BLOOM-1.7B and Qwen2 through Named Entity Recognition and sentence similarity tasks to understand their linguistic structures.
arXiv Detail & Related papers (2025-02-06T18:08:14Z) - Decomposed Prompting: Probing Multilingual Linguistic Structure Knowledge in Large Language Models [54.58989938395976]
We introduce a decomposed prompting approach for sequence labeling tasks.<n>We test our method on the Universal Dependencies part-of-speech tagging dataset for 38 languages.
arXiv Detail & Related papers (2024-02-28T15:15:39Z) - Cross-Lingual Ability of Multilingual Masked Language Models: A Study of
Language Structure [54.01613740115601]
We study three language properties: constituent order, composition and word co-occurrence.
Our main conclusion is that the contribution of constituent order and word co-occurrence is limited, while the composition is more crucial to the success of cross-linguistic transfer.
arXiv Detail & Related papers (2022-03-16T07:09:35Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - A multilabel approach to morphosyntactic probing [3.0013352260516744]
We show that multilingual BERT renders many morphosyntactic features easily and simultaneously extractable.
We evaluate the probes on six "held-out" languages in a zero-shot transfer setting.
arXiv Detail & Related papers (2021-04-17T06:24:04Z) - To What Degree Can Language Borders Be Blurred In BERT-based
Multilingual Spoken Language Understanding? [7.245261469258502]
We show that although a BERT-based multilingual Spoken Language Understanding (SLU) model works substantially well even on distant language groups, there is still a gap to the ideal multilingual performance.
We propose a novel BERT-based adversarial model architecture to learn language-shared and language-specific representations for multilingual SLU.
arXiv Detail & Related papers (2020-11-10T09:59:24Z) - A Study of Cross-Lingual Ability and Language-specific Information in
Multilingual BERT [60.9051207862378]
multilingual BERT works remarkably well on cross-lingual transfer tasks.
Datasize and context window size are crucial factors to the transferability.
There is a computationally cheap but effective approach to improve the cross-lingual ability of multilingual BERT.
arXiv Detail & Related papers (2020-04-20T11:13:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.