ParaShoot: A Hebrew Question Answering Dataset
- URL: http://arxiv.org/abs/2109.11314v1
- Date: Thu, 23 Sep 2021 11:59:38 GMT
- Title: ParaShoot: A Hebrew Question Answering Dataset
- Authors: Omri Keren and Omer Levy
- Abstract summary: ParaShoot is the first question-answering dataset in modern Hebrew.
We provide the first baseline results using recently-released BERT-style models for Hebrew.
- Score: 22.55706811131828
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: NLP research in Hebrew has largely focused on morphology and syntax, where
rich annotated datasets in the spirit of Universal Dependencies are available.
Semantic datasets, however, are in short supply, hindering crucial advances in
the development of NLP technology in Hebrew. In this work, we present
ParaShoot, the first question answering dataset in modern Hebrew. The dataset
follows the format and crowdsourcing methodology of SQuAD, and contains
approximately 3000 annotated examples, similar to other question-answering
datasets in low-resource languages. We provide the first baseline results using
recently-released BERT-style models for Hebrew, showing that there is
significant room for improvement on this task.
Related papers
- HeSum: a Novel Dataset for Abstractive Text Summarization in Hebrew [12.320161893898735]
HeSum is a benchmark specifically designed for abstractive text summarization in Modern Hebrew.
HeSum consists of 10,000 article-summary pairs sourced from Hebrew news websites written by professionals.
Linguistic analysis confirms HeSum's high abstractness and unique morphological challenges.
arXiv Detail & Related papers (2024-06-06T09:36:14Z) - Can a Multichoice Dataset be Repurposed for Extractive Question Answering? [52.28197971066953]
We repurposed the Belebele dataset (Bandarkar et al., 2023), which was designed for multiple-choice question answering (MCQA)
We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic (MSA).
Our aim is to enable others to adapt our approach for the 120+ other language variants in Belebele, many of which are deemed under-resourced.
arXiv Detail & Related papers (2024-04-26T11:46:05Z) - Aya Dataset: An Open-Access Collection for Multilingual Instruction
Tuning [49.79783940841352]
Existing datasets are almost all in the English language.
We work with fluent speakers of languages from around the world to collect natural instances of instructions and completions.
We create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages.
arXiv Detail & Related papers (2024-02-09T18:51:49Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - ivrit.ai: A Comprehensive Dataset of Hebrew Speech for AI Research and
Development [0.0]
ivrit.ai offers a substantial compilation of Hebrew speech across various contexts.
The dataset stands out for its legal accessibility, permitting use at no cost.
Future efforts aim to expand ivrit.ai further, thereby advancing Hebrew's standing in AI research and technology.
arXiv Detail & Related papers (2023-07-17T04:19:30Z) - PAXQA: Generating Cross-lingual Question Answering Examples at Training
Scale [53.92008514395125]
PAXQA (Projecting annotations for cross-lingual (x) QA) decomposes cross-lingual QA into two stages.
We propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts.
We show that models fine-tuned on these datasets outperform prior synthetic data generation models over several extractive QA datasets.
arXiv Detail & Related papers (2023-04-24T15:46:26Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - A Second Wave of UD Hebrew Treebanking and Cross-Domain Parsing [8.373151777137792]
This paper presents a new, freely available UD treebank of Hebrew from a range of topics selected from Hebrew Wikipedia.
In addition to introducing the corpus and evaluating the quality of its annotations, we deploy automatic validation tools based on grew.
We obtain new state-of-the-art (SOTA) results on UD NLP tasks, using a combination of the latest language modelling and some incremental improvements to existing transformer based approaches.
arXiv Detail & Related papers (2022-10-14T14:52:07Z) - WANLI: Worker and AI Collaboration for Natural Language Inference
Dataset Creation [101.00109827301235]
We introduce a novel paradigm for dataset creation based on human and machine collaboration.
We use dataset cartography to automatically identify examples that demonstrate challenging reasoning patterns, and instruct GPT-3 to compose new examples with similar patterns.
The resulting dataset, WANLI, consists of 108,357 natural language inference (NLI) examples that present unique empirical strengths.
arXiv Detail & Related papers (2022-01-16T03:13:49Z) - AlephBERT:A Hebrew Large Pre-Trained Language Model to Start-off your
Hebrew NLP Application With [7.345047237652976]
Large Pre-trained Language Models (PLMs) have become ubiquitous in the development of language understanding technology.
While advances reported for English using PLMs are unprecedented, reported advances using PLMs in Hebrew are few and far between.
arXiv Detail & Related papers (2021-04-08T20:51:29Z) - Low resource language dataset creation, curation and classification:
Setswana and Sepedi -- Extended Abstract [2.3801001093799115]
We create datasets that are focused on news headlines for Setswana and Sepedi.
We propose baselines for classification, and investigate an approach on data augmentation better suited to low-resourced languages.
arXiv Detail & Related papers (2020-03-30T18:03:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.