Two CFG Nahuatl for automatic corpora expansion
- URL: http://arxiv.org/abs/2512.14239v1
- Date: Tue, 16 Dec 2025 09:49:31 GMT
- Title: Two CFG Nahuatl for automatic corpora expansion
- Authors: Juan-José Guzmán-Landa, Juan-Manuel Torres-Moreno, Miguel Figueroa-Saavedra, Ligia Quintana-Torres, Graham Ranger Martha-Lorena Avendaño-Garrido,
- Abstract summary: This article introduces two Context-Free Grammars (CFG) for Nawatl Corpora expansion.<n>Naavell is an Amerindian language (it is a National Language of Mexico) of the $$-language type.<n>The goal is to produce a substantial number of syntactically valid artificial Nawatl sentences.
- Score: 0.22577070341971636
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The aim of this article is to introduce two Context-Free Grammars (CFG) for Nawatl Corpora expansion. Nawatl is an Amerindian language (it is a National Language of Mexico) of the $π$-language type, i.e. a language with few digital resources. For this reason the corpora available for the learning of Large Language Models (LLMs) are virtually non-existent, posing a significant challenge. The goal is to produce a substantial number of syntactically valid artificial Nawatl sentences and thereby to expand the corpora for the purpose of learning non contextual embeddings. For this objective, we introduce two new Nawatl CFGs and use them in generative mode. Using these grammars, it is possible to expand Nawatl corpus significantly and subsequently to use it to learn embeddings and to evaluate their relevance in a sentences semantic similarity task. The results show an improvement compared to the results obtained using only the original corpus without artificial expansion, and also demonstrate that economic embeddings often perform better than some LLMs.
Related papers
- IASC: Interactive Agentic System for ConLangs [4.567171631759881]
We present a system that uses LLMs as a tool in the development of Constructed Languages.<n>The system creates a target phonology for the language using an agentic approach.<n>A lexicon is constructed using the phonological model and the set of morphemes.<n>The system can also translate further sentences into the target language.
arXiv Detail & Related papers (2025-10-08T22:27:45Z) - A First Context-Free Grammar Applied to Nawatl Corpora Augmentation [0.21498988090998952]
We introduce a context-free grammar (CFG) for the Nawatl language.<n>Nawatl is an Amerindian language with few digital resources.<n>We show that a grammar enables us significantly to expand a corpus in Nawatl.
arXiv Detail & Related papers (2025-10-06T15:46:54Z) - $π$-yalli: un nouveau corpus pour le nahuatl [0.8247755416642547]
The NAHU$2$ project is a Franco-Mexican collaboration aimed at building the $pi$-YALLI corpus adapted to machine learning.<n>The $pi$-YALLI corpus will be used to develop computer resources for the Nahuatl language.
arXiv Detail & Related papers (2024-12-20T12:03:10Z) - Predictability and Causality in Spanish and English Natural Language Generation [6.817247544942709]
This paper compares causal and non-causal language modeling for English and Spanish.
According to this experiment, Spanish is more predictable than English given a non-causal context.
These insights support further research in NLG in Spanish using bidirectional transformer language models.
arXiv Detail & Related papers (2024-08-26T14:09:28Z) - Towards Effective Disambiguation for Machine Translation with Large
Language Models [65.80775710657672]
We study the capabilities of large language models to translate "ambiguous sentences"
Experiments show that our methods can match or outperform state-of-the-art systems such as DeepL and NLLB in four out of five language directions.
arXiv Detail & Related papers (2023-09-20T22:22:52Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Turkish Native Language Identification V2 [1.7802147489386628]
This paper presents the first application of Native Language Identification (NLI) for the Turkish language.<n>We analyze a corpus of texts written by native speakers of Albanian, Arabic and Persian.<n>Our models achieve promising results, and we analyze the most predictive features to reveal L1-specific transfer effects.
arXiv Detail & Related papers (2023-07-27T13:28:31Z) - CLSE: Corpus of Linguistically Significant Entities [58.29901964387952]
We release a Corpus of Linguistically Significant Entities (CLSE) annotated by experts.
CLSE covers 74 different semantic types to support various applications from airline ticketing to video games.
We create a linguistically representative NLG evaluation benchmark in three languages: French, Marathi, and Russian.
arXiv Detail & Related papers (2022-11-04T12:56:12Z) - Linking Emergent and Natural Languages via Corpus Transfer [98.98724497178247]
We propose a novel way to establish a link by corpus transfer between emergent languages and natural languages.
Our approach showcases non-trivial transfer benefits for two different tasks -- language modeling and image captioning.
We also introduce a novel metric to predict the transferability of an emergent language by translating emergent messages to natural language captions grounded on the same images.
arXiv Detail & Related papers (2022-03-24T21:24:54Z) - Improving the Lexical Ability of Pretrained Language Models for
Unsupervised Neural Machine Translation [127.81351683335143]
Cross-lingual pretraining requires models to align the lexical- and high-level representations of the two languages.
Previous research has shown that this is because the representations are not sufficiently aligned.
In this paper, we enhance the bilingual masked language model pretraining with lexical-level information by using type-level cross-lingual subword embeddings.
arXiv Detail & Related papers (2021-03-18T21:17:58Z) - The Return of Lexical Dependencies: Neural Lexicalized PCFGs [103.41187595153652]
We present novel neural models of lexicalized PCFGs which allow us to overcome sparsity problems.
Experiments demonstrate that this unified framework results in stronger results on both representations than achieved when either formalism alone.
arXiv Detail & Related papers (2020-07-29T22:12:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.