A Benchmark and Scoring Algorithm for Enriching Arabic Synonyms
- URL: http://arxiv.org/abs/2302.02232v1
- Date: Sat, 4 Feb 2023 20:30:32 GMT
- Title: A Benchmark and Scoring Algorithm for Enriching Arabic Synonyms
- Authors: Sana Ghanem, Mustafa Jarrar, Radi Jarrar, Ibrahim Bounhas
- Abstract summary: Given a mono/multilingual synset and a threshold (a fuzzy value [0-1]), our goal is to extract new synonyms above this threshold from existing lexicons.
The dataset consists of 3K candidate synonyms for 500 synsets.
Our evaluations show that the algorithm behaves like a linguist and its fuzzy values are close to those proposed by linguists.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper addresses the task of extending a given synset with additional
synonyms taking into account synonymy strength as a fuzzy value. Given a
mono/multilingual synset and a threshold (a fuzzy value [0-1]), our goal is to
extract new synonyms above this threshold from existing lexicons. We present
twofold contributions: an algorithm and a benchmark dataset. The dataset
consists of 3K candidate synonyms for 500 synsets. Each candidate synonym is
annotated with a fuzzy value by four linguists. The dataset is important for
(i) understanding how much linguists (dis/)agree on synonymy, in addition to
(ii) using the dataset as a baseline to evaluate our algorithm. Our proposed
algorithm extracts synonyms from existing lexicons and computes a fuzzy value
for each candidate. Our evaluations show that the algorithm behaves like a
linguist and its fuzzy values are close to those proposed by linguists (using
RMSE and MAE). The dataset and a demo page are publicly available at
https://portal.sina.birzeit.edu/synonyms.
Related papers
- Computational Approaches for Integrating out Subjectivity in Cognate Synonym Selection [45.14832807541816]
In the early days of language phylogenetics it was recommended to select one synonym only.
We show that binary character matrices do allow for representing the entire dataset including all synonyms.
We also make available a Python interface for generating all of the above character matrix types for cognate data provided in CLDF format.
arXiv Detail & Related papers (2024-04-30T07:52:26Z) - Syntax and Semantics Meet in the "Middle": Probing the Syntax-Semantics
Interface of LMs Through Agentivity [68.8204255655161]
We present the semantic notion of agentivity as a case study for probing such interactions.
This suggests LMs may potentially serve as more useful tools for linguistic annotation, theory testing, and discovery.
arXiv Detail & Related papers (2023-05-29T16:24:01Z) - Disambiguation of Company names via Deep Recurrent Networks [101.90357454833845]
We propose a Siamese LSTM Network approach to extract -- via supervised learning -- an embedding of company name strings.
We analyse how an Active Learning approach to prioritise the samples to be labelled leads to a more efficient overall learning pipeline.
arXiv Detail & Related papers (2023-03-07T15:07:57Z) - Text Summarization with Oracle Expectation [88.39032981994535]
Extractive summarization produces summaries by identifying and concatenating the most important sentences in a document.
Most summarization datasets do not come with gold labels indicating whether document sentences are summary-worthy.
We propose a simple yet effective labeling algorithm that creates soft, expectation-based sentence labels.
arXiv Detail & Related papers (2022-09-26T14:10:08Z) - Always Keep your Target in Mind: Studying Semantics and Improving
Performance of Neural Lexical Substitution [124.99894592871385]
We present a large-scale comparative study of lexical substitution methods employing both old and most recent language models.
We show that already competitive results achieved by SOTA LMs/MLMs can be further substantially improved if information about the target word is injected properly.
arXiv Detail & Related papers (2022-06-07T16:16:19Z) - Semantic-Preserving Adversarial Text Attacks [85.32186121859321]
We propose a Bigram and Unigram based adaptive Semantic Preservation Optimization (BU-SPO) method to examine the vulnerability of deep models.
Our method achieves the highest attack success rates and semantics rates by changing the smallest number of words compared with existing methods.
arXiv Detail & Related papers (2021-08-23T09:05:18Z) - Interval Probabilistic Fuzzy WordNet [8.396691008449704]
We present an algorithm for constructing the Interval Probabilistic Fuzzy (IPF) synsets in any language.
We constructed and published the IPF synsets of WordNet for English language.
arXiv Detail & Related papers (2021-04-04T17:28:37Z) - Extracting Synonyms from Bilingual Dictionaries [1.1470070927586016]
We present our progress in developing a novel algorithm to extract synonyms from bilingual dictionaries.
The idea is to construct a translation graph from translation pairs, then to extract and consolidate cyclic paths to form bilingual sets of synonyms.
The initial evaluation of this algorithm illustrates promising results in extracting Arabic-English bilingual synonyms.
arXiv Detail & Related papers (2020-12-01T16:09:22Z) - PARADE: A New Dataset for Paraphrase Identification Requiring Computer
Science Domain Knowledge [35.66853329610162]
PARADE contains paraphrases that overlap very little at the lexical and syntactic level but are semantically equivalent based on computer science domain knowledge.
Experiments show that both state-of-the-art neural models and non-expert human annotators have poor performance on PARADE.
arXiv Detail & Related papers (2020-10-08T02:01:31Z) - SynSetExpan: An Iterative Framework for Joint Entity Set Expansion and
Synonym Discovery [66.24624547470175]
SynSetExpan is a novel framework that enables two tasks to mutually enhance each other.
We create the first large-scale Synonym-Enhanced Set Expansion dataset via crowdsourcing.
Experiments on the SE2 dataset and previous benchmarks demonstrate the effectiveness of SynSetExpan for both entity set expansion and synonym discovery tasks.
arXiv Detail & Related papers (2020-09-29T07:32:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.