Advancing the Arabic WordNet: Elevating Content Quality
- URL: http://arxiv.org/abs/2403.20215v1
- Date: Fri, 29 Mar 2024 14:54:19 GMT
- Title: Advancing the Arabic WordNet: Elevating Content Quality
- Authors: Abed Alhakim Freihat, Hadi Khalilia, Gábor Bella, Fausto Giunchiglia,
- Abstract summary: We introduce a major revision of the Arabic WordNet that addresses multiple dimensions of lexico-semantic resource quality.
We update more than 58% of the synsets of the existing Arabic WordNet by adding missing information and correcting errors.
In order to address issues of language diversity and untranslatability, we also extended the wordnet structure by new elements: phrasets and lexical gaps.
- Score: 8.438749883590216
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: High-quality WordNets are crucial for achieving high-quality results in NLP applications that rely on such resources. However, the wordnets of most languages suffer from serious issues of correctness and completeness with respect to the words and word meanings they define, such as incorrect lemmas, missing glosses and example sentences, or an inadequate, Western-centric representation of the morphology and the semantics of the language. Previous efforts have largely focused on increasing lexical coverage while ignoring other qualitative aspects. In this paper, we focus on the Arabic language and introduce a major revision of the Arabic WordNet that addresses multiple dimensions of lexico-semantic resource quality. As a result, we updated more than 58% of the synsets of the existing Arabic WordNet by adding missing information and correcting errors. In order to address issues of language diversity and untranslatability, we also extended the wordnet structure by new elements: phrasets and lexical gaps.
Related papers
- Word Sense Disambiguation in Native Spanish: A Comprehensive Lexical Evaluation Resource [2.7775559369441964]
A lexical meaning of a word in context can be determined automatically by Word Sense Disambiguation (WSD) algorithms.
This study introduces a new resource for Spanish WSD.
It includes a sense inventory and a lexical dataset sourced from the Diccionario de la Lengua Espanola.
arXiv Detail & Related papers (2024-09-30T17:22:33Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Tokenization Impacts Multilingual Language Modeling: Assessing
Vocabulary Allocation and Overlap Across Languages [3.716965622352967]
We propose new criteria to evaluate the quality of lexical representation and vocabulary overlap observed in sub-word tokenizers.
Our findings show that the overlap of vocabulary across languages can be actually detrimental to certain downstream tasks.
arXiv Detail & Related papers (2023-05-26T18:06:49Z) - Towards preserving word order importance through Forced Invalidation [80.33036864442182]
We show that pre-trained language models are insensitive to word order.
We propose Forced Invalidation to help preserve the importance of word order.
Our experiments demonstrate that Forced Invalidation significantly improves the sensitivity of the models to word order.
arXiv Detail & Related papers (2023-04-11T13:42:10Z) - CLSE: Corpus of Linguistically Significant Entities [58.29901964387952]
We release a Corpus of Linguistically Significant Entities (CLSE) annotated by experts.
CLSE covers 74 different semantic types to support various applications from airline ticketing to video games.
We create a linguistically representative NLG evaluation benchmark in three languages: French, Marathi, and Russian.
arXiv Detail & Related papers (2022-11-04T12:56:12Z) - On the Difficulty of Translating Free-Order Case-Marking Languages [2.9434930072968584]
We investigate whether free-order case-marking languages are more difficult to translate by state-of-the-art Neural Machine Translation models (NMT)
We find that word order flexibility in the source language only leads to a very small loss of NMT quality.
In medium- and low-resource settings, the overall NMT quality of fixed-order languages remains unmatched.
arXiv Detail & Related papers (2021-07-13T13:09:58Z) - Speakers Fill Lexical Semantic Gaps with Context [65.08205006886591]
We operationalise the lexical ambiguity of a word as the entropy of meanings it can take.
We find significant correlations between our estimate of ambiguity and the number of synonyms a word has in WordNet.
This suggests that, in the presence of ambiguity, speakers compensate by making contexts more informative.
arXiv Detail & Related papers (2020-10-05T17:19:10Z) - Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z) - Multi-Fusion Chinese WordNet (MCW) : Compound of Machine Learning and
Manual Correction [7.471172518764192]
Five Chinese wordnets have been developed to solve the problems of syntax and semantics.
They include: Northeastern University Chinese WordNet (NEW), Sinica Bilingual Ontological WordNet (BOW), Southeast University Chinese WordNet (SEW), Taiwan University Chinese WordNet (CWN), Chinese Open WordNet (COW)
We decided to make a new Chinese wordnet called Multi-Fusion Chinese Wordnet (MCW) to make up those shortcomings.
arXiv Detail & Related papers (2020-02-05T12:44:01Z) - On the Importance of Word Order Information in Cross-lingual Sequence
Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages.
We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.