MRL Parsing Without Tears: The Case of Hebrew
- URL: http://arxiv.org/abs/2403.06970v1
- Date: Mon, 11 Mar 2024 17:54:33 GMT
- Title: MRL Parsing Without Tears: The Case of Hebrew
- Authors: Shaltiel Shmidman, Avi Shmidman, Moshe Koppel, Reut Tsarfaty
- Abstract summary: In morphologically rich languages (MRLs), wheres need to identify multiple lexical units in each token, existing systems suffer in latency and setup complexity.
We present a new "flipped pipeline": decisions are made directly on the whole-token units by expert classifiers, each one dedicated to one specific task.
This blazingly fast approach sets a new SOTA in Hebrew POS tagging and dependency parsing, while also reaching near-SOTA performance on other Hebrew tasks.
- Score: 14.104766026682384
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Syntactic parsing remains a critical tool for relation extraction and
information extraction, especially in resource-scarce languages where LLMs are
lacking. Yet in morphologically rich languages (MRLs), where parsers need to
identify multiple lexical units in each token, existing systems suffer in
latency and setup complexity. Some use a pipeline to peel away the layers:
first segmentation, then morphology tagging, and then syntax parsing; however,
errors in earlier layers are then propagated forward. Others use a joint
architecture to evaluate all permutations at once; while this improves
accuracy, it is notoriously slow. In contrast, and taking Hebrew as a test
case, we present a new "flipped pipeline": decisions are made directly on the
whole-token units by expert classifiers, each one dedicated to one specific
task. The classifiers are independent of one another, and only at the end do we
synthesize their predictions. This blazingly fast approach sets a new SOTA in
Hebrew POS tagging and dependency parsing, while also reaching near-SOTA
performance on other Hebrew NLP tasks. Because our architecture does not rely
on any language-specific resources, it can serve as a model to develop similar
parsers for other MRLs.
Related papers
- Training Neural Networks as Recognizers of Formal Languages [87.06906286950438]
Formal language theory pertains specifically to recognizers.
It is common to instead use proxy tasks that are similar in only an informal sense.
We correct this mismatch by training and evaluating neural networks directly as binary classifiers of strings.
arXiv Detail & Related papers (2024-11-11T16:33:25Z) - MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization [81.83460411131931]
In multilingual settings, non-Latin scripts and low-resource languages are usually disadvantaged in terms of language models' utility, efficiency, and cost.
We propose multilingual adaptive gradient-based tokenization to reduce over-segmentation via adaptive gradient-based subword tokenization.
arXiv Detail & Related papers (2024-07-11T18:59:21Z) - A Truly Joint Neural Architecture for Segmentation and Parsing [15.866519123942457]
Performance of Morphologically Rich Languages (MRLs) is lower than other languages.
Due to high morphological complexity and ambiguity of the space-delimited input tokens, the linguistic units that act as nodes in the tree are not known in advance.
We introduce a joint neural architecture where a lattice-based representation preserving all morphological ambiguity of the input is provided to an arc-factored model, which then solves the morphological and syntactic parsing tasks at once.
arXiv Detail & Related papers (2024-02-04T16:56:08Z) - Multilingual Sequence-to-Sequence Models for Hebrew NLP [16.010560946005473]
We show that sequence-to-sequence generative architectures are more suitable for morphologically rich languages (MRLs) such as Hebrew.
We demonstrate that by casting tasks in the Hebrew NLP pipeline as text-to-text tasks, we can leverage powerful multilingual, pretrained sequence-to-sequence models as mT5.
arXiv Detail & Related papers (2022-12-19T18:10:23Z) - Zero-Shot Cross-lingual Semantic Parsing [56.95036511882921]
We study cross-lingual semantic parsing as a zero-shot problem without parallel data for 7 test languages.
We propose a multi-task encoder-decoder model to transfer parsing knowledge to additional languages using only English-Logical form paired data.
Our system frames zero-shot parsing as a latent-space alignment problem and finds that pre-trained models can be improved to generate logical forms with minimal cross-lingual transfer penalty.
arXiv Detail & Related papers (2021-04-15T16:08:43Z) - Low-Resource Task-Oriented Semantic Parsing via Intrinsic Modeling [65.51280121472146]
We exploit what we intrinsically know about ontology labels to build efficient semantic parsing models.
Our model is highly efficient using a low-resource benchmark derived from TOPv2.
arXiv Detail & Related papers (2021-04-15T04:01:02Z) - Multilingual Autoregressive Entity Linking [49.35994386221958]
mGENRE is a sequence-to-sequence system for the Multilingual Entity Linking problem.
For a mention in a given language, mGENRE predicts the name of the target entity left-to-right, token-by-token.
We show the efficacy of our approach through extensive evaluation including experiments on three popular MEL benchmarks.
arXiv Detail & Related papers (2021-03-23T13:25:55Z) - Applying Occam's Razor to Transformer-Based Dependency Parsing: What
Works, What Doesn't, and What is Really Necessary [9.347252855045125]
We study the choice of pre-trained embeddings and whether they use LSTM layers in graph-based dependency schemes.
We propose a simple but widely applicable architecture and configuration, achieving new state-of-the-art results (in terms of LAS) for 10 out of 12 diverse languages.
arXiv Detail & Related papers (2020-10-23T22:58:26Z) - Don't Parse, Insert: Multilingual Semantic Parsing with Insertion Based
Decoding [10.002379593718471]
A successful parse transforms an input utterance to an action that is easily understood by the system.
For complex parsing tasks, the state-of-the-art method is based on autoregressive sequence to sequence models to generate the parse directly.
arXiv Detail & Related papers (2020-10-08T01:18:42Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - Parsing as Pretraining [13.03764728768944]
We first cast constituent and dependency parsing as sequence tagging.
We then use a single feed-forward layer to directly map word vectors to labels that encode a linearized tree.
arXiv Detail & Related papers (2020-02-05T08:43:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.