Using Linguistic Typology to Enrich Multilingual Lexicons: the Case of
Lexical Gaps in Kinship
- URL: http://arxiv.org/abs/2204.05049v1
- Date: Mon, 11 Apr 2022 12:36:26 GMT
- Title: Using Linguistic Typology to Enrich Multilingual Lexicons: the Case of
Lexical Gaps in Kinship
- Authors: Temuulen Khishigsuren, G\'abor Bella, Khuyagbaatar Batsuren, Abed
Alhakim Freihat, Nandu Chandran Nair, Amarsanaa Ganbold, Hadi Khalilia,
Yamini Chandrashekar, Fausto Giunchiglia
- Abstract summary: We capture the phenomenon of diversity through the notions of lexical gap and language-specific word.
We publish a lexico-semantic resource consisting of 198 domain concepts, 1,911 words, and 37,370 gaps covering 699 languages.
- Score: 4.970603969125883
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper describes a method to enrich lexical resources with content
relating to linguistic diversity, based on knowledge from the field of lexical
typology. We capture the phenomenon of diversity through the notions of lexical
gap and language-specific word and use a systematic method to infer gaps
semi-automatically on a large scale. As a first result obtained for the domain
of kinship terminology, known to be very diverse throughout the world, we
publish a lexico-semantic resource consisting of 198 domain concepts, 1,911
words, and 37,370 gaps covering 699 languages. We see potential in the use of
resources such as ours for the improvement of a variety of cross-lingual NLP
tasks, which we demonstrate through a downstream application for the evaluation
of machine translation systems.
Related papers
- Crowdsourcing Lexical Diversity [7.569845058082537]
This paper proposes a novel crowdsourcing methodology for reducing bias in lexicons.
Crowd workers compare lexemes from two languages, focusing on domains rich in lexical diversity, such as kinship or food.
We validated our method by applying it to two case studies focused on food-related terminology.
arXiv Detail & Related papers (2024-10-30T15:45:09Z) - LexGen: Domain-aware Multilingual Lexicon Generation [40.97738267067852]
We propose a new model to generate dictionary words for 6 Indian languages in the multi-domain setting.
Our model consists of domain-specific and domain-generic layers that encode information.
We release a new benchmark dataset across 6 Indian languages that span 8 diverse domains.
arXiv Detail & Related papers (2024-05-18T07:02:43Z) - Lexical Diversity in Kinship Across Languages and Dialects [6.80465507148218]
We introduce a method to enrich computational lexicons with content relating to linguistic diversity.
The method is verified through two large-scale case studies on kinship terminology.
arXiv Detail & Related papers (2023-08-24T19:49:30Z) - Multilingual Word Sense Disambiguation with Unified Sense Representation [55.3061179361177]
We propose building knowledge and supervised-based Multilingual Word Sense Disambiguation (MWSD) systems.
We build unified sense representations for multiple languages and address the annotation scarcity problem for MWSD by transferring annotations from rich-sourced languages to poorer ones.
Evaluations of SemEval-13 and SemEval-15 datasets demonstrate the effectiveness of our methodology.
arXiv Detail & Related papers (2022-10-14T01:24:03Z) - UAlberta at SemEval 2022 Task 2: Leveraging Glosses and Translations for
Multilingual Idiomaticity Detection [4.66831886752751]
We describe the University of Alberta systems for the SemEval-2022 Task 2 on multilingual idiomaticity detection.
Under the assumption that idiomatic expressions are noncompositional, our first method integrates information on the meanings of the individual words of an expression into a binary classifier.
Our second method translates an expression in context, and uses a lexical knowledge base to determine if the translation is literal.
arXiv Detail & Related papers (2022-05-27T16:35:00Z) - Probing Pretrained Language Models for Lexical Semantics [76.73599166020307]
We present a systematic empirical analysis across six typologically diverse languages and five different lexical tasks.
Our results indicate patterns and best practices that hold universally, but also point to prominent variations across languages and tasks.
arXiv Detail & Related papers (2020-10-12T14:24:01Z) - Gender Bias in Multilingual Embeddings and Cross-Lingual Transfer [101.58431011820755]
We study gender bias in multilingual embeddings and how it affects transfer learning for NLP applications.
We create a multilingual dataset for bias analysis and propose several ways for quantifying bias in multilingual representations.
arXiv Detail & Related papers (2020-05-02T04:34:37Z) - Linguistic Typology Features from Text: Inferring the Sparse Features of
World Atlas of Language Structures [73.06435180872293]
We construct a recurrent neural network predictor based on byte embeddings and convolutional layers.
We show that some features from various linguistic types can be predicted reliably.
arXiv Detail & Related papers (2020-04-30T21:00:53Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z) - A Common Semantic Space for Monolingual and Cross-Lingual
Meta-Embeddings [10.871587311621974]
This paper presents a new technique for creating monolingual and cross-lingual meta-embeddings.
Existing word vectors are projected to a common semantic space using linear transformations and averaging.
The resulting cross-lingual meta-embeddings also exhibit excellent cross-lingual transfer learning capabilities.
arXiv Detail & Related papers (2020-01-17T15:42:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.