Language Diversity: Visible to Humans, Exploitable by Machines
- URL: http://arxiv.org/abs/2203.04723v1
- Date: Wed, 9 Mar 2022 14:04:16 GMT
- Title: Language Diversity: Visible to Humans, Exploitable by Machines
- Authors: G\'abor Bella, Erdenebileg Byambadorj, Yamini Chandrashekar,
Khuyagbaatar Batsuren, Danish Ashgar Cheema, Fausto Giunchiglia
- Abstract summary: The UKC website lets users explore millions of individual words and their meanings.
The UKC LiveLanguage Catalogue provides access to the underlying lexical data in a computer-processable form.
- Score: 6.0111449785499484
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Universal Knowledge Core (UKC) is a large multilingual lexical database
with a focus on language diversity and covering over a thousand languages. The
aim of the database, as well as its tools and data catalogue, is to make the
somewhat abstract notion of diversity visually understandable for humans and
formally exploitable by machines. The UKC website lets users explore millions
of individual words and their meanings, but also phenomena of cross-lingual
convergence and divergence, such as shared interlingual meanings, lexicon
similarities, cognate clusters, or lexical gaps. The UKC LiveLanguage
Catalogue, in turn, provides access to the underlying lexical data in a
computer-processable form, ready to be reused in cross-lingual applications.
Related papers
- Variationist: Exploring Multifaceted Variation and Bias in Written Language Data [3.666781404469562]
Exploring and understanding language data is a fundamental stage in all areas dealing with human language.
Yet, there is currently a lack of a unified, customizable tool to seamlessly inspect and visualize language variation and bias.
In this paper, we introduce Variationist, a highly-modular, descriptive, and task-agnostic tool that fills this gap.
arXiv Detail & Related papers (2024-06-25T15:41:07Z) - Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora.
But can these models relate corresponding concepts across languages, effectively being crosslingual?
This study evaluates six state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z) - Lexical Diversity in Kinship Across Languages and Dialects [6.80465507148218]
We introduce a method to enrich computational lexicons with content relating to linguistic diversity.
The method is verified through two large-scale case studies on kinship terminology.
arXiv Detail & Related papers (2023-08-24T19:49:30Z) - Multi-lingual and Multi-cultural Figurative Language Understanding [69.47641938200817]
Figurative language permeates human communication, but is relatively understudied in NLP.
We create a dataset for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba.
Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region.
All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data.
arXiv Detail & Related papers (2023-05-25T15:30:31Z) - Representing Interlingual Meaning in Lexical Databases [5.654039329474587]
We show that existing lexical databases have structural limitations that result in a reduced expressivity on culturally-specific words.
In particular, the lexical meaning space of dominant languages, such as English, is represented more accurately while linguistically or culturally diverse languages are mapped in an approximate manner.
arXiv Detail & Related papers (2023-01-22T17:41:29Z) - The Geometry of Multilingual Language Model Representations [25.880639246639323]
We assess how multilingual language models maintain a shared multilingual representation space while still encoding language-sensitive information in each language.
The subspace means differ along language-sensitive axes that are relatively stable throughout middle layers, and these axes encode information such as token vocabularies.
We visualize representations projected onto language-sensitive and language-neutral axes, identifying language family and part-of-speech clusters, along with spirals, toruses, and curves representing token position information.
arXiv Detail & Related papers (2022-05-22T23:58:24Z) - Gender Bias in Multilingual Embeddings and Cross-Lingual Transfer [101.58431011820755]
We study gender bias in multilingual embeddings and how it affects transfer learning for NLP applications.
We create a multilingual dataset for bias analysis and propose several ways for quantifying bias in multilingual representations.
arXiv Detail & Related papers (2020-05-02T04:34:37Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.