AfriVEC: Word Embedding Models for African Languages. Case Study of Fon
and Nobiin
- URL: http://arxiv.org/abs/2103.05132v1
- Date: Mon, 8 Mar 2021 22:58:20 GMT
- Title: AfriVEC: Word Embedding Models for African Languages. Case Study of Fon
and Nobiin
- Authors: Bonaventure F. P. Dossou and Mohammed Sabry
- Abstract summary: We build Word2Vec and Poincar'e word embedding models for Fon and Nobiin.
Our main contribution is to arouse more interest in creating word embedding models proper to African Languages.
- Score: 0.015863809575305417
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: From Word2Vec to GloVe, word embedding models have played key roles in the
current state-of-the-art results achieved in Natural Language Processing.
Designed to give significant and unique vectorized representations of words and
entities, those models have proven to efficiently extract similarities and
establish relationships reflecting semantic and contextual meaning among words
and entities. African Languages, representing more than 31% of the worldwide
spoken languages, have recently been subject to lots of research. However, to
the best of our knowledge, there are currently very few to none word embedding
models for those languages words and entities, and none for the languages under
study in this paper. After describing Glove, Word2Vec, and Poincar\'e
embeddings functionalities, we build Word2Vec and Poincar\'e word embedding
models for Fon and Nobiin, which show promising results. We test the
applicability of transfer learning between these models as a landmark for
African Languages to jointly involve in mitigating the scarcity of their
resources, and attempt to provide linguistic and social interpretations of our
results. Our main contribution is to arouse more interest in creating word
embedding models proper to African Languages, ready for use, and that can
significantly improve the performances of Natural Language Processing
downstream tasks on them. The official repository and implementation is at
https://github.com/bonaventuredossou/afrivec
Related papers
- NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Tokenization Impacts Multilingual Language Modeling: Assessing
Vocabulary Allocation and Overlap Across Languages [3.716965622352967]
We propose new criteria to evaluate the quality of lexical representation and vocabulary overlap observed in sub-word tokenizers.
Our findings show that the overlap of vocabulary across languages can be actually detrimental to certain downstream tasks.
arXiv Detail & Related papers (2023-05-26T18:06:49Z) - AfroLM: A Self-Active Learning-based Multilingual Pretrained Language
Model for 23 African Languages [0.021987601456703476]
We present AfroLM, a multilingual language model pretrained from scratch on 23 African languages.
AfroLM is pretrained on a dataset 14x smaller than existing baselines.
It is able to generalize well across various domains.
arXiv Detail & Related papers (2022-11-07T02:15:25Z) - MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity
Recognition [55.95128479289923]
African languages are spoken by over a billion people, but are underrepresented in NLP research and development.
We create the largest human-annotated NER dataset for 20 African languages.
We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points.
arXiv Detail & Related papers (2022-10-22T08:53:14Z) - Transparency Helps Reveal When Language Models Learn Meaning [71.96920839263457]
Our systematic experiments with synthetic data reveal that, with languages where all expressions have context-independent denotations, both autoregressive and masked language models learn to emulate semantic relations between expressions.
Turning to natural language, our experiments with a specific phenomenon -- referential opacity -- add to the growing body of evidence that current language models do not well-represent natural language semantics.
arXiv Detail & Related papers (2022-10-14T02:35:19Z) - Low-Resource Language Modelling of South African Languages [6.805575417034369]
We evaluate the performance of open-vocabulary language models on low-resource South African languages.
We evaluate different variants of n-gram models, feedforward neural networks, recurrent neural networks (RNNs) and Transformers on small-scale datasets.
Overall, well-regularized RNNs give the best performance across two isiZulu and one Sepedi datasets.
arXiv Detail & Related papers (2021-04-01T21:27:27Z) - OkwuGb\'e: End-to-End Speech Recognition for Fon and Igbo [0.015863809575305417]
We present a state-of-art ASR model for Fon, as well as benchmark ASR model results for Igbo.
We conduct a comprehensive linguistic analysis of each language and describe the creation of end-to-end, deep neural network-based speech recognition models for both languages.
arXiv Detail & Related papers (2021-03-13T18:02:44Z) - Revisiting Language Encoding in Learning Multilingual Representations [70.01772581545103]
We propose a new approach called Cross-lingual Language Projection (XLP) to replace language embedding.
XLP projects the word embeddings into language-specific semantic space, and then the projected embeddings will be fed into the Transformer model.
Experiments show that XLP can freely and significantly boost the model performance on extensive multilingual benchmark datasets.
arXiv Detail & Related papers (2021-02-16T18:47:10Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.