A Novel Challenge Set for Hebrew Morphological Disambiguation and
Diacritics Restoration
- URL: http://arxiv.org/abs/2010.02864v1
- Date: Tue, 6 Oct 2020 16:34:03 GMT
- Title: A Novel Challenge Set for Hebrew Morphological Disambiguation and
Diacritics Restoration
- Authors: Avi Shmidman, Joshua Guedalia, Shaltiel Shmidman, Moshe Koppel, Reut
Tsarfaty
- Abstract summary: We offer a challenge set for Hebrew homographs -- the first of its kind.
We show that the current SOTA of Hebrew disambiguation performs poorly on cases of unbalanced ambiguity.
We achieve a new state-of-the-art for all 21 words, improving the overall average F1 score from 0.67 to 0.95.
- Score: 8.704581499692651
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One of the primary tasks of morphological parsers is the disambiguation of
homographs. Particularly difficult are cases of unbalanced ambiguity, where one
of the possible analyses is far more frequent than the others. In such cases,
there may not exist sufficient examples of the minority analyses in order to
properly evaluate performance, nor to train effective classifiers. In this
paper we address the issue of unbalanced morphological ambiguities in Hebrew.
We offer a challenge set for Hebrew homographs -- the first of its kind --
containing substantial attestation of each analysis of 21 Hebrew homographs. We
show that the current SOTA of Hebrew disambiguation performs poorly on cases of
unbalanced ambiguity. Leveraging our new dataset, we achieve a new
state-of-the-art for all 21 words, improving the overall average F1 score from
0.67 to 0.95. Our resulting annotated datasets are made publicly available for
further research.
Related papers
- Beyond Coarse-Grained Matching in Video-Text Retrieval [50.799697216533914]
We introduce a new approach for fine-grained evaluation.
Our approach can be applied to existing datasets by automatically generating hard negative test captions.
Experiments on our fine-grained evaluations demonstrate that this approach enhances a model's ability to understand fine-grained differences.
arXiv Detail & Related papers (2024-10-16T09:42:29Z) - Persian Homograph Disambiguation: Leveraging ParsBERT for Enhanced Sentence Understanding with a Novel Word Disambiguation Dataset [0.0]
We introduce a novel dataset tailored for Persian homograph disambiguation.
Our work encompasses a thorough exploration of various embeddings, evaluated through the cosine similarity method.
We scrutinize the models' performance in terms of Accuracy, Recall, and F1 Score.
arXiv Detail & Related papers (2024-05-24T14:56:36Z) - Do Pretrained Contextual Language Models Distinguish between Hebrew Homograph Analyses? [12.631897904322676]
We study the extent to which Hebrew homographs can be disambiguated and analyzed using pre-trained language models.
We show that contemporary Hebrew contextualized embeddings outperform non-contextualized embeddings.
We also show that these embeddings are equally effective for homographs of both balanced and skewed distributions.
arXiv Detail & Related papers (2024-05-11T21:50:56Z) - Revisiting subword tokenization: A case study on affixal negation in large language models [57.75279238091522]
We measure the impact of affixal negation on modern English large language models (LLMs)
We conduct experiments using LLMs with different subword tokenization methods.
We show that models can, on the whole, reliably recognize the meaning of affixal negation.
arXiv Detail & Related papers (2024-04-03T03:14:27Z) - Understanding and Mitigating Classification Errors Through Interpretable
Token Patterns [58.91023283103762]
Characterizing errors in easily interpretable terms gives insight into whether a classifier is prone to making systematic errors.
We propose to discover those patterns of tokens that distinguish correct and erroneous predictions.
We show that our method, Premise, performs well in practice.
arXiv Detail & Related papers (2023-11-18T00:24:26Z) - Investigating Multilingual Coreference Resolution by Universal
Annotations [11.035051211351213]
We study coreference by examining the ground truth data at different linguistic levels.
We perform an error analysis of the most challenging cases that the SotA system fails to resolve.
We extract features from universal morphosyntactic annotations and integrate these features into a baseline system to assess their potential benefits.
arXiv Detail & Related papers (2023-10-26T18:50:04Z) - We're Afraid Language Models Aren't Modeling Ambiguity [136.8068419824318]
Managing ambiguity is a key part of human language understanding.
We characterize ambiguity in a sentence by its effect on entailment relations with another sentence.
We show that a multilabel NLI model can flag political claims in the wild that are misleading due to ambiguity.
arXiv Detail & Related papers (2023-04-27T17:57:58Z) - A Latent-Variable Model for Intrinsic Probing [93.62808331764072]
We propose a novel latent-variable formulation for constructing intrinsic probes.
We find empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.
arXiv Detail & Related papers (2022-01-20T15:01:12Z) - Modelling Latent Translations for Cross-Lingual Transfer [47.61502999819699]
We propose a new technique that integrates both steps of the traditional pipeline (translation and classification) into a single model.
We evaluate our novel latent translation-based model on a series of multilingual NLU tasks.
We report gains for both zero-shot and few-shot learning setups, up to 2.7 accuracy points on average.
arXiv Detail & Related papers (2021-07-23T17:11:27Z) - Inference-only sub-character decomposition improves translation of
unseen logographic characters [18.148675498274866]
Neural Machine Translation (NMT) on logographic source languages struggles when translating unseen' characters.
We investigate existing ideograph-based sub-character decomposition approaches for Chinese-to-English and Japanese-to-English NMT.
We find that complete sub-character decomposition often harms unseen character translation, and gives inconsistent results generally.
arXiv Detail & Related papers (2020-11-12T17:36:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.