chDzDT: Word-level morphology-aware language model for Algerian social media text
- URL: http://arxiv.org/abs/2509.01772v1
- Date: Mon, 01 Sep 2025 21:09:55 GMT
- Title: chDzDT: Word-level morphology-aware language model for Algerian social media text
- Authors: Abdelkrime Aries,
- Abstract summary: chDzDT is a character-level pre-trained language model tailored for Algerian morphology.<n>It is trained on isolated words without depending on token boundaries or standardized orthography.<n>It covers multiple scripts and linguistic varieties, resulting in a substantial pre-training workload.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Pre-trained language models (PLMs) have substantially advanced natural language processing by providing context-sensitive text representations. However, the Algerian dialect remains under-represented, with few dedicated models available. Processing this dialect is challenging due to its complex morphology, frequent code-switching, multiple scripts, and strong lexical influences from other languages. These characteristics complicate tokenization and reduce the effectiveness of conventional word- or subword-level approaches. To address this gap, we introduce chDzDT, a character-level pre-trained language model tailored for Algerian morphology. Unlike conventional PLMs that rely on token sequences, chDzDT is trained on isolated words. This design allows the model to encode morphological patterns robustly, without depending on token boundaries or standardized orthography. The training corpus draws from diverse sources, including YouTube comments, French, English, and Berber Wikipedia, as well as the Tatoeba project. It covers multiple scripts and linguistic varieties, resulting in a substantial pre-training workload. Our contributions are threefold: (i) a detailed morphological analysis of Algerian dialect using YouTube comments; (ii) the construction of a multilingual Algerian lexicon dataset; and (iii) the development and extensive evaluation of a character-level PLM as a morphology-focused encoder for downstream tasks. The proposed approach demonstrates the potential of character-level modeling for morphologically rich, low-resource dialects and lays a foundation for more inclusive and adaptable NLP systems.
Related papers
- Beyond Subtokens: A Rich Character Embedding for Low-resource and Morphologically Complex Languages [5.338837380875301]
Tokenization and sub-tokenization based models like word2vec, BERT and the GPTs are the state-of-the-art in natural language processing.<n>We propose to compute word vectors directly from character strings, integrating both semantic and syntactic information.<n>It has the potential to improve performance for both large context-based language models like BERT and small models like word2vec for under-resourced and morphologically rich languages.
arXiv Detail & Related papers (2026-02-24T21:16:08Z) - Modelling the Morphology of Verbal Paradigms: A Case Study in the Tokenization of Turkish and Hebrew [1.0857263744676489]
We investigate how transformer models represent complex verb paradigms in Turkish and Modern Hebrew.<n>We show that for Turkish, both monolingual and multilingual models succeed, either when tokenization is atomic or when it breaks words into small subword units.<n>For Hebrew, instead, monolingual and multilingual models diverge.
arXiv Detail & Related papers (2026-02-05T13:31:21Z) - On The Landscape of Spoken Language Models: A Comprehensive Survey [144.11278973534203]
spoken language models (SLMs) act as universal speech processing systems.<n>Work in this area is very diverse, with a range of terminology and evaluation settings.
arXiv Detail & Related papers (2025-04-11T13:40:53Z) - Resource-Aware Arabic LLM Creation: Model Adaptation, Integration, and Multi-Domain Testing [0.0]
This paper presents a novel approach to fine-tuning the Qwen2-1.5B model for Arabic language processing using Quantized Low-Rank Adaptation (QLoRA) on a system with only 4GB VRAM.<n>We detail the process of adapting this large language model to the Arabic domain, using diverse datasets including Bactrian, OpenAssistant, and Wikipedia Arabic corpora.<n> Experimental results over 10,000 training steps show significant performance improvements, with the final loss converging to 0.1083.
arXiv Detail & Related papers (2024-12-23T13:08:48Z) - Tokenization and Morphology in Multilingual Language Models: A Comparative Analysis of mT5 and ByT5 [4.779196219827507]
We capture the impact of tokenization by contrasting two multilingual language models: mT5 and ByT5.
Probing the morphological knowledge encoded in these models on four tasks and 17 languages, our analyses show that the models learn the morphological systems of some languages better than others.
arXiv Detail & Related papers (2024-10-15T14:14:19Z) - Explicit Morphological Knowledge Improves Pre-training of Language
Models for Hebrew [19.4968960182412]
We investigate the hypothesis that incorporating explicit morphological knowledge in the pre-training phase can improve the performance of PLMs for morphologically rich languages.
We propose various morphologically driven tokenization methods enabling the model to leverage morphological cues beyond raw text.
Our experiments show that morphologically driven tokenization demonstrates improved results compared to a standard language-agnostic tokenization.
arXiv Detail & Related papers (2023-11-01T17:02:49Z) - Post-hoc analysis of Arabic transformer models [20.741730718486032]
We probe how linguistic information is encoded in the transformer models, trained on different Arabic dialects.
We perform a layer and neuron analysis on the models using morphological tagging tasks for different dialects of Arabic and a dialectal identification task.
arXiv Detail & Related papers (2022-10-18T16:53:51Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Revisiting Language Encoding in Learning Multilingual Representations [70.01772581545103]
We propose a new approach called Cross-lingual Language Projection (XLP) to replace language embedding.
XLP projects the word embeddings into language-specific semantic space, and then the projected embeddings will be fed into the Transformer model.
Experiments show that XLP can freely and significantly boost the model performance on extensive multilingual benchmark datasets.
arXiv Detail & Related papers (2021-02-16T18:47:10Z) - What does it mean to be language-agnostic? Probing multilingual sentence
encoders for typological properties [17.404220737977738]
We propose methods for probing sentence representations from state-of-the-art multilingual encoders.
Our results show interesting differences in encoding linguistic variation associated with different pretraining strategies.
arXiv Detail & Related papers (2020-09-27T15:00:52Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z) - A Simple Joint Model for Improved Contextual Neural Lemmatization [60.802451210656805]
We present a simple joint neural model for lemmatization and morphological tagging that achieves state-of-the-art results on 20 languages.
Our paper describes the model in addition to training and decoding procedures.
arXiv Detail & Related papers (2019-04-04T02:03:19Z) - Cross-lingual, Character-Level Neural Morphological Tagging [57.0020906265213]
We train character-level recurrent neural taggers to predict morphological taggings for high-resource languages and low-resource languages together.<n>Learning joint character representations among multiple related languages successfully enables knowledge transfer from the high-resource languages to the low-resource ones, improving accuracy by up to 30% over a monolingual model.
arXiv Detail & Related papers (2017-08-30T08:14:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.