Co-occurrences using Fasttext embeddings for word similarity tasks in
Urdu
- URL: http://arxiv.org/abs/2102.10957v1
- Date: Mon, 22 Feb 2021 12:56:26 GMT
- Title: Co-occurrences using Fasttext embeddings for word similarity tasks in
Urdu
- Authors: Usama Khalid, Aizaz Hussain, Muhammad Umair Arshad, Waseem Shahzad and
Mirza Omer Beg
- Abstract summary: This paper builds a corpus for Urdu by scraping and integrating data from various sources.
We modify fasttext embeddings and N-Grams models to enable training them on our built corpus.
We have used these trained embeddings for a word similarity task and compared the results with existing techniques.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Urdu is a widely spoken language in South Asia. Though immoderate literature
exists for the Urdu language still the data isn't enough to naturally process
the language by NLP techniques. Very efficient language models exist for the
English language, a high resource language, but Urdu and other under-resourced
languages have been neglected for a long time. To create efficient language
models for these languages we must have good word embedding models. For Urdu,
we can only find word embeddings trained and developed using the skip-gram
model. In this paper, we have built a corpus for Urdu by scraping and
integrating data from various sources and compiled a vocabulary for the Urdu
language. We also modify fasttext embeddings and N-Grams models to enable
training them on our built corpus. We have used these trained embeddings for a
word similarity task and compared the results with existing techniques.
Related papers
- NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Urdu Morphology, Orthography and Lexicon Extraction [0.0]
This paper describes an implementation of the Urdu language as a software API.
We deal with orthography, morphology and the extraction of the lexicon.
arXiv Detail & Related papers (2022-04-06T20:14:01Z) - Multilingual Text Classification for Dravidian Languages [4.264592074410622]
We propose a multilingual text classification framework for the Dravidian languages.
On the one hand, the framework used the LaBSE pre-trained model as the base model.
On the other hand, in view of the problem that the model cannot well recognize and utilize the correlation among languages, we further proposed a language-specific representation module.
arXiv Detail & Related papers (2021-12-03T04:26:49Z) - Constrained Language Models Yield Few-Shot Semantic Parsers [73.50960967598654]
We explore the use of large pretrained language models as few-shot semantics.
The goal in semantic parsing is to generate a structured meaning representation given a natural language input.
We use language models to paraphrase inputs into a controlled sublanguage resembling English that can be automatically mapped to a target meaning representation.
arXiv Detail & Related papers (2021-04-18T08:13:06Z) - Bilingual Language Modeling, A transfer learning technique for Roman
Urdu [0.0]
We show how code-switching property of languages may be used to perform cross-lingual transfer learning from a corresponding high resource language.
We also show how this transfer learning technique termed Bilingual Language Modeling can be used to produce better performing models for Roman Urdu.
arXiv Detail & Related papers (2021-02-22T12:56:37Z) - HinFlair: pre-trained contextual string embeddings for pos tagging and
text classification in the Hindi language [0.0]
HinFlair is a language representation model (contextual string embeddings) pre-trained on a large monolingual Hindi corpus.
Results show that HinFlair outperforms previous state-of-the-art publicly available pre-trained embeddings for downstream tasks like text classification and pos tagging.
arXiv Detail & Related papers (2021-01-18T09:23:35Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Efficient Urdu Caption Generation using Attention based LSTM [0.0]
Urdu is the national language of Pakistan and also much spoken and understood in the sub-continent region of Pakistan-India.
We develop an attention-based deep learning model using techniques of sequence modeling specialized for the Urdu language.
We evaluate our proposed technique on this dataset and show that it can achieve a BLEU score of 0.83 in the Urdu language.
arXiv Detail & Related papers (2020-08-02T17:22:33Z) - Making Monolingual Sentence Embeddings Multilingual using Knowledge
Distillation [73.65237422910738]
We present an easy and efficient method to extend existing sentence embedding models to new languages.
This allows to create multilingual versions from previously monolingual models.
arXiv Detail & Related papers (2020-04-21T08:20:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.