A Second Wave of UD Hebrew Treebanking and Cross-Domain Parsing
- URL: http://arxiv.org/abs/2210.07873v2
- Date: Tue, 18 Oct 2022 14:53:07 GMT
- Title: A Second Wave of UD Hebrew Treebanking and Cross-Domain Parsing
- Authors: Amir Zeldes, Nick Howell, Noam Ordan and Yifat Ben Moshe
- Abstract summary: This paper presents a new, freely available UD treebank of Hebrew from a range of topics selected from Hebrew Wikipedia.
In addition to introducing the corpus and evaluating the quality of its annotations, we deploy automatic validation tools based on grew.
We obtain new state-of-the-art (SOTA) results on UD NLP tasks, using a combination of the latest language modelling and some incremental improvements to existing transformer based approaches.
- Score: 8.373151777137792
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Foundational Hebrew NLP tasks such as segmentation, tagging and parsing, have
relied to date on various versions of the Hebrew Treebank (HTB, Sima'an et al.
2001). However, the data in HTB, a single-source newswire corpus, is now over
30 years old, and does not cover many aspects of contemporary Hebrew on the
web. This paper presents a new, freely available UD treebank of Hebrew
stratified from a range of topics selected from Hebrew Wikipedia. In addition
to introducing the corpus and evaluating the quality of its annotations, we
deploy automatic validation tools based on grew (Guillaume, 2021), and conduct
the first cross domain parsing experiments in Hebrew. We obtain new
state-of-the-art (SOTA) results on UD NLP tasks, using a combination of the
latest language modelling and some incremental improvements to existing
transformer based approaches. We also release a new version of the UD HTB
matching annotation scheme updates from our new corpus.
Related papers
- HebDB: a Weakly Supervised Dataset for Hebrew Speech Processing [22.74199529315638]
HebDB is a weakly supervised dataset for spoken language processing in the Hebrew language.
HebDB offers roughly 2500 hours of natural and spontaneous speech recordings in the Hebrew language.
arXiv Detail & Related papers (2024-07-10T11:51:26Z) - Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers.
Recent efforts to develop NLP technologies for African languages have focused on their standard dialects.
We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z) - A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - MaiBaam: A Multi-Dialectal Bavarian Universal Dependency Treebank [56.810282574817414]
We present the first multi-dialect Bavarian treebank (MaiBaam) manually annotated with part-of-speech and syntactic dependency information in Universal Dependencies (UD)
We highlight the morphosyntactic differences between the closely-related Bavarian and German and showcase the rich variability of speakers' orthographies.
Our corpus includes 15k tokens, covering dialects from all Bavarian-speaking areas spanning three countries.
arXiv Detail & Related papers (2024-03-15T13:33:10Z) - Extending Multilingual Machine Translation through Imitation Learning [60.15671816513614]
Imit-MNMT treats the task as an imitation learning process, which mimicks the behavior of an expert.
We show that our approach significantly improves the translation performance between the new and the original languages.
We also demonstrate that our approach is capable of solving copy and off-target problems.
arXiv Detail & Related papers (2023-11-14T21:04:03Z) - Introducing DictaLM -- A Large Generative Language Model for Modern
Hebrew [2.1547347528250875]
We present DictaLM, a large-scale language model tailored for Modern Hebrew.
As a commitment to promoting research and development in the Hebrew language, we release both the foundation model and the instruct-tuned model under a Creative Commons license.
arXiv Detail & Related papers (2023-09-25T22:42:09Z) - DictaBERT: A State-of-the-Art BERT Suite for Modern Hebrew [2.421705925711388]
We present DictaBERT, a new state-of-the-art pre-trained BERT model for modern Hebrew.
We release three fine-tuned versions of the model, designed to perform three foundational tasks in the analysis of Hebrew texts.
arXiv Detail & Related papers (2023-08-31T12:43:18Z) - ParaShoot: A Hebrew Question Answering Dataset [22.55706811131828]
ParaShoot is the first question-answering dataset in modern Hebrew.
We provide the first baseline results using recently-released BERT-style models for Hebrew.
arXiv Detail & Related papers (2021-09-23T11:59:38Z) - Assessing the quality of sources in Wikidata across languages: a hybrid
approach [64.05097584373979]
We run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages.
We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata.
The findings help us ascertain the quality of references in Wikidata, and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web.
arXiv Detail & Related papers (2021-09-20T10:06:46Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Building a Hebrew Semantic Role Labeling Lexical Resource from Parallel
Movie Subtitles [4.089055556130724]
We present a semantic role labeling resource for Hebrew built semi-automatically through annotation projection from English.
This corpus is derived from the multilingual OpenSubtitles dataset and includes short informal sentences.
We provide a fully annotated version of the data including morphological analysis, dependency syntax and semantic role labeling in both FrameNet and PropBank styles.
We train a neural SRL model on this Hebrew resource exploiting the pre-trained multilingual BERT transformer model, and provide the first available baseline model for Hebrew SRL as a reference point.
arXiv Detail & Related papers (2020-05-17T10:03:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.