Weakly-supervised Deep Cognate Detection Framework for Low-Resourced
Languages Using Morphological Knowledge of Closely-Related Languages
- URL: http://arxiv.org/abs/2311.05155v1
- Date: Thu, 9 Nov 2023 05:46:41 GMT
- Title: Weakly-supervised Deep Cognate Detection Framework for Low-Resourced
Languages Using Morphological Knowledge of Closely-Related Languages
- Authors: Koustava Goswami, Priya Rani, Theodorus Fransen, John P. McCrae
- Abstract summary: Exploiting cognates for transfer learning in under-resourced languages is an exciting opportunity for language understanding tasks.
Previous approaches mainly focused on supervised cognate detection tasks based on orthographic, phonetic or state-of-the-art contextual language models.
This paper proposes a novel language-agnostic weakly-supervised deep cognate detection framework for under-resourced languages.
- Score: 1.7622337807395716
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Exploiting cognates for transfer learning in under-resourced languages is an
exciting opportunity for language understanding tasks, including unsupervised
machine translation, named entity recognition and information retrieval.
Previous approaches mainly focused on supervised cognate detection tasks based
on orthographic, phonetic or state-of-the-art contextual language models, which
under-perform for most under-resourced languages. This paper proposes a novel
language-agnostic weakly-supervised deep cognate detection framework for
under-resourced languages using morphological knowledge from closely related
languages. We train an encoder to gain morphological knowledge of a language
and transfer the knowledge to perform unsupervised and weakly-supervised
cognate detection tasks with and without the pivot language for the
closely-related languages. While unsupervised, it overcomes the need for
hand-crafted annotation of cognates. We performed experiments on different
published cognate detection datasets across language families and observed not
only significant improvement over the state-of-the-art but also our method
outperformed the state-of-the-art supervised and unsupervised methods. Our
model can be extended to a wide range of languages from any language family as
it overcomes the requirement of the annotation of the cognate pairs for
training. The code and dataset building scripts can be found at
https://github.com/koustavagoswami/Weakly_supervised-Cognate_Detection
Related papers
- A Computational Model for the Assessment of Mutual Intelligibility Among
Closely Related Languages [1.5773159234875098]
Closely related languages show linguistic similarities that allow speakers of one language to understand speakers of another language without having actively learned it.
Mutual intelligibility varies in degree and is typically tested in psycholinguistic experiments.
We propose a computer-assisted method using the Linear Discriminative Learner to approximate the cognitive processes by which humans learn languages.
arXiv Detail & Related papers (2024-02-05T11:32:13Z) - Overcoming Language Disparity in Online Content Classification with
Multimodal Learning [22.73281502531998]
Large language models are now the standard to develop state-of-the-art solutions for text detection and classification tasks.
The development of advanced computational techniques and resources is disproportionately focused on the English language.
We explore the promise of incorporating the information contained in images via multimodal machine learning.
arXiv Detail & Related papers (2022-05-19T17:56:02Z) - Utilizing Wordnets for Cognate Detection among Indian Languages [50.83320088758705]
We detect cognate word pairs among ten Indian languages with Hindi.
We use deep learning methodologies to predict whether a word pair is cognate or not.
We report improved performance of up to 26%.
arXiv Detail & Related papers (2021-12-30T16:46:28Z) - Cross-Lingual Adaptation for Type Inference [29.234418962960905]
We propose a cross-lingual adaptation framework, PLATO, to transfer a deep learning-based type inference procedure across weakly typed languages.
By leveraging data from strongly typed languages, PLATO improves the perplexity of the backbone cross-programming-language model.
arXiv Detail & Related papers (2021-07-01T00:20:24Z) - Improving the Lexical Ability of Pretrained Language Models for
Unsupervised Neural Machine Translation [127.81351683335143]
Cross-lingual pretraining requires models to align the lexical- and high-level representations of the two languages.
Previous research has shown that this is because the representations are not sufficiently aligned.
In this paper, we enhance the bilingual masked language model pretraining with lexical-level information by using type-level cross-lingual subword embeddings.
arXiv Detail & Related papers (2021-03-18T21:17:58Z) - Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images.
"vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora.
Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - On the Importance of Word Order Information in Cross-lingual Sequence
Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages.
We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z) - Cross-lingual, Character-Level Neural Morphological Tagging [57.0020906265213]
We train character-level recurrent neural taggers to predict morphological taggings for high-resource languages and low-resource languages together.
Learning joint character representations among multiple related languages successfully enables knowledge transfer from the high-resource languages to the low-resource ones, improving accuracy by up to 30% over a monolingual model.
arXiv Detail & Related papers (2017-08-30T08:14:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.