"A Passage to India": Pre-trained Word Embeddings for Indian Languages
- URL: http://arxiv.org/abs/2112.13800v1
- Date: Mon, 27 Dec 2021 17:31:04 GMT
- Title: "A Passage to India": Pre-trained Word Embeddings for Indian Languages
- Authors: Kumar Saurav, Kumar Saunack, Diptesh Kanojia, Pushpak Bhattacharyya
- Abstract summary: We use various existing approaches to create multiple word embeddings for 14 Indian languages.
We place these embeddings for all these languages in a single repository.
We release a total of 436 models using 8 different approaches.
- Score: 30.607474624873014
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Dense word vectors or 'word embeddings' which encode semantic properties of
words, have now become integral to NLP tasks like Machine Translation (MT),
Question Answering (QA), Word Sense Disambiguation (WSD), and Information
Retrieval (IR). In this paper, we use various existing approaches to create
multiple word embeddings for 14 Indian languages. We place these embeddings for
all these languages, viz., Assamese, Bengali, Gujarati, Hindi, Kannada,
Konkani, Malayalam, Marathi, Nepali, Odiya, Punjabi, Sanskrit, Tamil, and
Telugu in a single repository. Relatively newer approaches that emphasize
catering to context (BERT, ELMo, etc.) have shown significant improvements, but
require a large amount of resources to generate usable models. We release
pre-trained embeddings generated using both contextual and non-contextual
approaches. We also use MUSE and XLM to train cross-lingual embeddings for all
pairs of the aforementioned languages. To show the efficacy of our embeddings,
we evaluate our embedding models on XPOS, UPOS and NER tasks for all these
languages. We release a total of 436 models using 8 different approaches. We
hope they are useful for the resource-constrained Indian language NLP. The
title of this paper refers to the famous novel 'A Passage to India' by E.M.
Forster, published initially in 1924.
Related papers
- Dictionary Insertion Prompting for Multilingual Reasoning on Multilingual Large Language Models [52.00446751692225]
We present a novel and simple yet effective method called textbfDictionary textbfInsertion textbfPrompting (textbfDIP)
When providing a non-English prompt, DIP looks up a word dictionary and inserts words' English counterparts into the prompt for LLMs.
It then enables better translation into English and better English model thinking steps which leads to obviously better results.
arXiv Detail & Related papers (2024-11-02T05:10:50Z) - Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z) - Fine-tuning Pre-trained Named Entity Recognition Models For Indian Languages [6.7638050195383075]
We analyze the challenges and propose techniques that can be tailored for Multilingual Named Entity Recognition for Indian languages.
We present a human annotated named entity corpora of 40K sentences for 4 Indian languages from two of the major Indian language families.
We achieve comparable performance on completely unseen benchmark datasets for Indian languages which affirms the usability of our model.
arXiv Detail & Related papers (2024-05-08T05:54:54Z) - Paramanu: A Family of Novel Efficient Generative Foundation Language Models for Indian Languages [3.9018931027384056]
We present "Paramanu", a family of novel language models (LM) for Indian languages.
It covers 10 languages (Assamese, Bangla, Hindi, Konkani, Maithili, Marathi, Odia, Sanskrit, Tamil, Telugu) across 5 scripts.
The models are pretrained on a single GPU with context size of 1024 and vary in size from 13.29 million (M) to 367.5 M parameters.
arXiv Detail & Related papers (2024-01-31T17:58:10Z) - IndicTrans2: Towards High-Quality and Accessible Machine Translation
Models for all 22 Scheduled Indian Languages [37.758476568195256]
India has a rich linguistic landscape with languages from 4 major language families spoken by over a billion people.
22 of these languages are listed in the Constitution of India (referred to as scheduled languages)
arXiv Detail & Related papers (2023-05-25T17:57:43Z) - Machine Translation by Projecting Text into the Same
Phonetic-Orthographic Space Using a Common Encoding [3.0422770070015295]
We propose an approach based on common multilingual Latin-based encodings (WX notation) that take advantage of language similarity.
We verify the proposed approach by demonstrating experiments on similar language pairs.
We also get up to 1 BLEU points improvement on distant and zero-shot language pairs.
arXiv Detail & Related papers (2023-05-21T06:46:33Z) - L3Cube-MahaNLP: Marathi Natural Language Processing Datasets, Models,
and Library [1.14219428942199]
Despite being the third most popular language in India, the Marathi language lacks useful NLP resources.
With L3Cube-MahaNLP, we aim to build resources and a library for Marathi natural language processing.
We present datasets and transformer models for supervised tasks like sentiment analysis, named entity recognition, and hate speech detection.
arXiv Detail & Related papers (2022-05-29T17:51:00Z) - Challenge Dataset of Cognates and False Friend Pairs from Indian
Languages [54.6340870873525]
Cognates are present in multiple variants of the same text across different languages.
In this paper, we describe the creation of two cognate datasets for twelve Indian languages.
arXiv Detail & Related papers (2021-12-17T14:23:43Z) - Harnessing Cross-lingual Features to Improve Cognate Detection for
Low-resource Languages [50.82410844837726]
We demonstrate the use of cross-lingual word embeddings for detecting cognates among fourteen Indian languages.
We evaluate our methods to detect cognates on a challenging dataset of twelve Indian languages.
We observe an improvement of up to 18% points, in terms of F-score, for cognate detection.
arXiv Detail & Related papers (2021-12-16T11:17:58Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.