Challenge Dataset of Cognates and False Friend Pairs from Indian
Languages
- URL: http://arxiv.org/abs/2112.09526v1
- Date: Fri, 17 Dec 2021 14:23:43 GMT
- Title: Challenge Dataset of Cognates and False Friend Pairs from Indian
Languages
- Authors: Diptesh Kanojia, Pushpak Bhattacharyya, Malhar Kulkarni, Gholamreza
Haffari
- Abstract summary: Cognates are present in multiple variants of the same text across different languages.
In this paper, we describe the creation of two cognate datasets for twelve Indian languages.
- Score: 54.6340870873525
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cognates are present in multiple variants of the same text across different
languages (e.g., "hund" in German and "hound" in English language mean "dog").
They pose a challenge to various Natural Language Processing (NLP) applications
such as Machine Translation, Cross-lingual Sense Disambiguation, Computational
Phylogenetics, and Information Retrieval. A possible solution to address this
challenge is to identify cognates across language pairs. In this paper, we
describe the creation of two cognate datasets for twelve Indian languages,
namely Sanskrit, Hindi, Assamese, Oriya, Kannada, Gujarati, Tamil, Telugu,
Punjabi, Bengali, Marathi, and Malayalam. We digitize the cognate data from an
Indian language cognate dictionary and utilize linked Indian language Wordnets
to generate cognate sets. Additionally, we use the Wordnet data to create a
False Friends' dataset for eleven language pairs. We also evaluate the efficacy
of our dataset using previously available baseline cognate detection
approaches. We also perform a manual evaluation with the help of lexicographers
and release the curated gold-standard dataset with this paper.
Related papers
- Fine-tuning Pre-trained Named Entity Recognition Models For Indian Languages [6.7638050195383075]
We analyze the challenges and propose techniques that can be tailored for Multilingual Named Entity Recognition for Indian languages.
We present a human annotated named entity corpora of 40K sentences for 4 Indian languages from two of the major Indian language families.
We achieve comparable performance on completely unseen benchmark datasets for Indian languages which affirms the usability of our model.
arXiv Detail & Related papers (2024-05-08T05:54:54Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - Utilizing Wordnets for Cognate Detection among Indian Languages [50.83320088758705]
We detect cognate word pairs among ten Indian languages with Hindi.
We use deep learning methodologies to predict whether a word pair is cognate or not.
We report improved performance of up to 26%.
arXiv Detail & Related papers (2021-12-30T16:46:28Z) - Harnessing Cross-lingual Features to Improve Cognate Detection for
Low-resource Languages [50.82410844837726]
We demonstrate the use of cross-lingual word embeddings for detecting cognates among fourteen Indian languages.
We evaluate our methods to detect cognates on a challenging dataset of twelve Indian languages.
We observe an improvement of up to 18% points, in terms of F-score, for cognate detection.
arXiv Detail & Related papers (2021-12-16T11:17:58Z) - Cross-lingual Offensive Language Identification for Low Resource
Languages: The Case of Marathi [2.4737119633827174]
MOLD is the first dataset of its kind compiled for Marathi, opening a new domain for research in low-resource Indo-Aryan languages.
We present results from several machine learning experiments on this dataset, including zero-short and other transfer learning experiments on state-of-the-art cross-lingual transformers.
arXiv Detail & Related papers (2021-09-08T11:29:44Z) - Offensive Language Identification in Low-resourced Code-mixed Dravidian
languages using Pseudo-labeling [0.16252563723817934]
We classify codemixed social media comments/posts in the Dravidian languages of Tamil, Kannada, and Malayalam.
A custom dataset is constructed by transliterating all the code-mixed texts into the respective Dravidian language.
We fine-tune several recent pretrained language models on the newly constructed dataset.
arXiv Detail & Related papers (2021-08-27T08:43:08Z) - Phoneme Recognition through Fine Tuning of Phonetic Representations: a
Case Study on Luhya Language Varieties [77.2347265289855]
We focus on phoneme recognition using Allosaurus, a method for multilingual recognition based on phonetic annotation.
To evaluate in a challenging real-world scenario, we curate phone recognition datasets for Bukusu and Saamia, two varieties of the Luhya language cluster of western Kenya and eastern Uganda.
We find that fine-tuning of Allosaurus, even with just 100 utterances, leads to significant improvements in phone error rates.
arXiv Detail & Related papers (2021-04-04T15:07:55Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.