Utilizing Wordnets for Cognate Detection among Indian Languages
- URL: http://arxiv.org/abs/2112.15124v1
- Date: Thu, 30 Dec 2021 16:46:28 GMT
- Title: Utilizing Wordnets for Cognate Detection among Indian Languages
- Authors: Diptesh Kanojia, Kevin Patel, Pushpak Bhattacharyya, Malhar Kulkarni,
Gholamreza Haffari
- Abstract summary: We detect cognate word pairs among ten Indian languages with Hindi.
We use deep learning methodologies to predict whether a word pair is cognate or not.
We report improved performance of up to 26%.
- Score: 50.83320088758705
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic Cognate Detection (ACD) is a challenging task which has been
utilized to help NLP applications like Machine Translation, Information
Retrieval and Computational Phylogenetics. Unidentified cognate pairs can pose
a challenge to these applications and result in a degradation of performance.
In this paper, we detect cognate word pairs among ten Indian languages with
Hindi and use deep learning methodologies to predict whether a word pair is
cognate or not. We identify IndoWordnet as a potential resource to detect
cognate word pairs based on orthographic similarity-based methods and train
neural network models using the data obtained from it. We identify parallel
corpora as another potential resource and perform the same experiments for
them. We also validate the contribution of Wordnets through further
experimentation and report improved performance of up to 26%. We discuss the
nuances of cognate detection among closely related Indian languages and release
the lists of detected cognates as a dataset. We also observe the behaviour of,
to an extent, unrelated Indian language pairs and release the lists of detected
cognates among them as well.
Related papers
- Automated Cognate Detection as a Supervised Link Prediction Task with
Cognate Transformer [4.609569810881602]
Identification of cognates across related languages is one of the primary problems in historical linguistics.
We present a transformer-based architecture inspired by computational biology for the task of automated cognate detection.
arXiv Detail & Related papers (2024-02-05T11:47:36Z) - Weakly-supervised Deep Cognate Detection Framework for Low-Resourced
Languages Using Morphological Knowledge of Closely-Related Languages [1.7622337807395716]
Exploiting cognates for transfer learning in under-resourced languages is an exciting opportunity for language understanding tasks.
Previous approaches mainly focused on supervised cognate detection tasks based on orthographic, phonetic or state-of-the-art contextual language models.
This paper proposes a novel language-agnostic weakly-supervised deep cognate detection framework for under-resourced languages.
arXiv Detail & Related papers (2023-11-09T05:46:41Z) - Challenge Dataset of Cognates and False Friend Pairs from Indian
Languages [54.6340870873525]
Cognates are present in multiple variants of the same text across different languages.
In this paper, we describe the creation of two cognate datasets for twelve Indian languages.
arXiv Detail & Related papers (2021-12-17T14:23:43Z) - Harnessing Cross-lingual Features to Improve Cognate Detection for
Low-resource Languages [50.82410844837726]
We demonstrate the use of cross-lingual word embeddings for detecting cognates among fourteen Indian languages.
We evaluate our methods to detect cognates on a challenging dataset of twelve Indian languages.
We observe an improvement of up to 18% points, in terms of F-score, for cognate detection.
arXiv Detail & Related papers (2021-12-16T11:17:58Z) - Linguistic Classification using Instance-Based Learning [0.0]
We take a contrarian approach and question the tree-based model that is rather restrictive.
For example, the affinity that Sanskrit independently has with languages across Indo-European languages is better illustrated using a network model.
We can say the same about inter-relationship between languages in India, where the inter-relationships are better discovered than assumed.
arXiv Detail & Related papers (2020-12-02T04:12:10Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - GATE: Graph Attention Transformer Encoder for Cross-lingual Relation and
Event Extraction [107.8262586956778]
We introduce graph convolutional networks (GCNs) with universal dependency parses to learn language-agnostic sentence representations.
GCNs struggle to model words with long-range dependencies or are not directly connected in the dependency tree.
We propose to utilize the self-attention mechanism to learn the dependencies between words with different syntactic distances.
arXiv Detail & Related papers (2020-10-06T20:30:35Z) - On the Importance of Word Order Information in Cross-lingual Sequence
Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages.
We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.