Harnessing Cross-lingual Features to Improve Cognate Detection for
Low-resource Languages
- URL: http://arxiv.org/abs/2112.08789v1
- Date: Thu, 16 Dec 2021 11:17:58 GMT
- Title: Harnessing Cross-lingual Features to Improve Cognate Detection for
Low-resource Languages
- Authors: Diptesh Kanojia, Raj Dabre, Shubham Dewangan, Pushpak Bhattacharyya,
Gholamreza Haffari, Malhar Kulkarni
- Abstract summary: We demonstrate the use of cross-lingual word embeddings for detecting cognates among fourteen Indian languages.
We evaluate our methods to detect cognates on a challenging dataset of twelve Indian languages.
We observe an improvement of up to 18% points, in terms of F-score, for cognate detection.
- Score: 50.82410844837726
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cognates are variants of the same lexical form across different languages;
for example 'fonema' in Spanish and 'phoneme' in English are cognates, both of
which mean 'a unit of sound'. The task of automatic detection of cognates among
any two languages can help downstream NLP tasks such as Cross-lingual
Information Retrieval, Computational Phylogenetics, and Machine Translation. In
this paper, we demonstrate the use of cross-lingual word embeddings for
detecting cognates among fourteen Indian Languages. Our approach introduces the
use of context from a knowledge graph to generate improved feature
representations for cognate detection. We, then, evaluate the impact of our
cognate detection mechanism on neural machine translation (NMT), as a
downstream task. We evaluate our methods to detect cognates on a challenging
dataset of twelve Indian languages, namely, Sanskrit, Hindi, Assamese, Oriya,
Kannada, Gujarati, Tamil, Telugu, Punjabi, Bengali, Marathi, and Malayalam.
Additionally, we create evaluation datasets for two more Indian languages,
Konkani and Nepali. We observe an improvement of up to 18% points, in terms of
F-score, for cognate detection. Furthermore, we observe that cognates extracted
using our method help improve NMT quality by up to 2.76 BLEU. We also release
our code, newly constructed datasets and cross-lingual models publicly.
Related papers
- Weakly-supervised Deep Cognate Detection Framework for Low-Resourced
Languages Using Morphological Knowledge of Closely-Related Languages [1.7622337807395716]
Exploiting cognates for transfer learning in under-resourced languages is an exciting opportunity for language understanding tasks.
Previous approaches mainly focused on supervised cognate detection tasks based on orthographic, phonetic or state-of-the-art contextual language models.
This paper proposes a novel language-agnostic weakly-supervised deep cognate detection framework for under-resourced languages.
arXiv Detail & Related papers (2023-11-09T05:46:41Z) - Machine Translation by Projecting Text into the Same
Phonetic-Orthographic Space Using a Common Encoding [3.0422770070015295]
We propose an approach based on common multilingual Latin-based encodings (WX notation) that take advantage of language similarity.
We verify the proposed approach by demonstrating experiments on similar language pairs.
We also get up to 1 BLEU points improvement on distant and zero-shot language pairs.
arXiv Detail & Related papers (2023-05-21T06:46:33Z) - Utilizing Wordnets for Cognate Detection among Indian Languages [50.83320088758705]
We detect cognate word pairs among ten Indian languages with Hindi.
We use deep learning methodologies to predict whether a word pair is cognate or not.
We report improved performance of up to 26%.
arXiv Detail & Related papers (2021-12-30T16:46:28Z) - Challenge Dataset of Cognates and False Friend Pairs from Indian
Languages [54.6340870873525]
Cognates are present in multiple variants of the same text across different languages.
In this paper, we describe the creation of two cognate datasets for twelve Indian languages.
arXiv Detail & Related papers (2021-12-17T14:23:43Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - Cross-lingual Offensive Language Identification for Low Resource
Languages: The Case of Marathi [2.4737119633827174]
MOLD is the first dataset of its kind compiled for Marathi, opening a new domain for research in low-resource Indo-Aryan languages.
We present results from several machine learning experiments on this dataset, including zero-short and other transfer learning experiments on state-of-the-art cross-lingual transformers.
arXiv Detail & Related papers (2021-09-08T11:29:44Z) - Phoneme Recognition through Fine Tuning of Phonetic Representations: a
Case Study on Luhya Language Varieties [77.2347265289855]
We focus on phoneme recognition using Allosaurus, a method for multilingual recognition based on phonetic annotation.
To evaluate in a challenging real-world scenario, we curate phone recognition datasets for Bukusu and Saamia, two varieties of the Luhya language cluster of western Kenya and eastern Uganda.
We find that fine-tuning of Allosaurus, even with just 100 utterances, leads to significant improvements in phone error rates.
arXiv Detail & Related papers (2021-04-04T15:07:55Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.