Indian Language Wordnets and their Linkages with Princeton WordNet
- URL: http://arxiv.org/abs/2201.02977v1
- Date: Sun, 9 Jan 2022 10:12:31 GMT
- Title: Indian Language Wordnets and their Linkages with Princeton WordNet
- Authors: Diptesh Kanojia, Kevin Patel, Pushpak Bhattacharyya
- Abstract summary: We release mappings of 18 Indian language wordnets linked with Princeton WordNet.
We believe that availability of such resources will have a direct impact on the progress in NLP for these languages.
- Score: 38.50911435531732
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Wordnets are rich lexico-semantic resources. Linked wordnets are extensions
of wordnets, which link similar concepts in wordnets of different languages.
Such resources are extremely useful in many Natural Language Processing (NLP)
applications, primarily those based on knowledge-based approaches. In such
approaches, these resources are considered as gold standard/oracle. Thus, it is
crucial that these resources hold correct information. Thereby, they are
created by human experts. However, human experts in multiple languages are hard
to come by. Thus, the community would benefit from sharing of such manually
created resources. In this paper, we release mappings of 18 Indian language
wordnets linked with Princeton WordNet. We believe that availability of such
resources will have a direct impact on the progress in NLP for these languages.
Related papers
- Content-Localization based Neural Machine Translation for Informal
Dialectal Arabic: Spanish/French to Levantine/Gulf Arabic [5.2957928879391]
We propose a framework that localizes contents of high-resource languages to a low-resource language/dialects by utilizing AI power.
We are the first work to provide a parallel translation dataset from/to informal Spanish and French to/from informal Arabic dialects.
arXiv Detail & Related papers (2023-12-12T01:42:41Z) - Multilingual Word Embeddings for Low-Resource Languages using Anchors
and a Chain of Related Languages [54.832599498774464]
We propose to build multilingual word embeddings (MWEs) via a novel language chain-based approach.
We build MWEs one language at a time by starting from the resource rich source and sequentially adding each language in the chain till we reach the target.
We evaluate our method on bilingual lexicon induction for 4 language families, involving 4 very low-resource (5M tokens) and 4 moderately low-resource (50M) target languages.
arXiv Detail & Related papers (2023-11-21T09:59:29Z) - Contextualising Levels of Language Resourcedness affecting Digital
Processing of Text [0.5620321106679633]
We argue that the dichotomous typology LRL and HRL for all languages is problematic.
The characterization is based on the typology of contextual features for each category, rather than counting tools.
arXiv Detail & Related papers (2023-09-29T07:48:24Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Neural Machine Translation for the Indigenous Languages of the Americas:
An Introduction [102.13536517783837]
Most languages from the Americas are among them, having a limited amount of parallel and monolingual data, if any.
We discuss the recent advances and findings and open questions, product of an increased interest of the NLP community in these languages.
arXiv Detail & Related papers (2023-06-11T23:27:47Z) - A Survey of Corpora for Germanic Low-Resource Languages and Dialects [18.210880703295253]
This work focuses on low-resource languages and in particular non-standardized low-resource languages.
We make our overview of over 80 corpora publicly available to facilitate research.
arXiv Detail & Related papers (2023-04-19T16:45:16Z) - NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local
Languages [100.59889279607432]
We focus on developing resources for languages in Indonesia.
Most languages in Indonesia are categorized as endangered and some are even extinct.
We develop the first-ever parallel resource for 10 low-resource languages in Indonesia.
arXiv Detail & Related papers (2022-05-31T17:03:50Z) - Expanding Pretrained Models to Thousands More Languages via
Lexicon-based Adaptation [133.7313847857935]
Our study highlights how NLP methods can be adapted to thousands more languages that are under-served by current technology.
For 19 under-represented languages across 3 tasks, our methods lead to consistent improvements of up to 5 and 15 points with and without extra monolingual text respectively.
arXiv Detail & Related papers (2022-03-17T16:48:22Z) - Semi-automatic WordNet Linking using Word Embeddings [33.15250956247636]
Linked wordnets are extensions of wordnets, which link similar concepts in wordnets of different languages.
We propose an approach to link wordnets. Given a synset of the source language, the approach returns a ranked list of potential candidate synsets.
Our technique is able to retrieve a winner synset in the top 10 ranked list for 60% of all synsets and 70% of noun synsets.
arXiv Detail & Related papers (2022-01-05T18:15:55Z) - When Word Embeddings Become Endangered [0.685316573653194]
We present a method for constructing word embeddings for endangered languages using existing word embeddings of different resource-rich languages and translation dictionaries of resource-poor languages.
All our cross-lingual word embeddings and the sentiment analysis model have been released openly via an easy-to-use Python library.
arXiv Detail & Related papers (2021-03-24T15:42:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.