A Part-of-Speech Tagger for Yiddish
- URL: http://arxiv.org/abs/2204.01175v2
- Date: Fri, 18 Aug 2023 16:56:31 GMT
- Title: A Part-of-Speech Tagger for Yiddish
- Authors: Seth Kulick, Neville Ryant, Beatrice Santorini, Joel Wallenberg, Assaf
Urieli
- Abstract summary: This is the first step in a larger project of automatically assigning part-of-speech tags and syntactic structure to Yiddish text.
We combine two resources for the current work - an 80K-word subset of the Penn Parsed Corpus of Historical Yiddish (PPCHY) and 650 million words of OCR'd Yiddish text from the Yiddish Book Center (YBC)
We present some evidence that even simple non-contextualized embeddings trained on YBC are able to capture the relationships among spelling variants without the need to first "standardize" the corpus.
- Score: 4.57670708264108
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We describe the construction and evaluation of a part-of-speech tagger for
Yiddish. This is the first step in a larger project of automatically assigning
part-of-speech tags and syntactic structure to Yiddish text for purposes of
linguistic research. We combine two resources for the current work - an
80K-word subset of the Penn Parsed Corpus of Historical Yiddish (PPCHY) and 650
million words of OCR'd Yiddish text from the Yiddish Book Center (YBC). Yiddish
orthography in the YBC corpus has many spelling inconsistencies, and we present
some evidence that even simple non-contextualized embeddings trained on YBC are
able to capture the relationships among spelling variants without the need to
first "standardize" the corpus. We also use YBC for continued pretraining of
contexualized embeddings, which are then integrated into a tagger model trained
and evaluated on the PPCHY. We evaluate the tagger performance on a 10-fold
cross-validation split, showing that the use of the YBC text for the
contextualized embeddings improves tagger performance. We conclude by
discussing some next steps, including the need for additional annotated
training and test data.
Related papers
- Multi-Modal Multi-Granularity Tokenizer for Chu Bamboo Slip Scripts [65.10991154918737]
This study focuses on the Chu bamboo slip (CBS) script used during the Spring and Autumn and Warring States period (771-256 BCE) in Ancient China.
Our tokenizer first adopts character detection to locate character boundaries, and then conducts character recognition at both the character and sub-character levels.
To support the academic community, we have also assembled the first large-scale dataset of CBSs with over 100K annotated character image scans.
arXiv Detail & Related papers (2024-09-02T07:42:55Z) - Do Pretrained Contextual Language Models Distinguish between Hebrew Homograph Analyses? [12.631897904322676]
We study the extent to which Hebrew homographs can be disambiguated and analyzed using pre-trained language models.
We show that contemporary Hebrew contextualized embeddings outperform non-contextualized embeddings.
We also show that these embeddings are equally effective for homographs of both balanced and skewed distributions.
arXiv Detail & Related papers (2024-05-11T21:50:56Z) - Homonym Sense Disambiguation in the Georgian Language [49.1574468325115]
This research proposes a novel approach to the Word Sense Disambiguation (WSD) task in the Georgian language.
It is based on supervised fine-tuning of a pre-trained Large Language Model (LLM) on a dataset formed by filtering the Georgian Common Crawls corpus.
arXiv Detail & Related papers (2024-04-24T21:48:43Z) - Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine
Translation of Lecture Transcripts [50.00305136008848]
We propose a framework for parallel corpus mining, which provides a quick and effective way to mine a parallel corpus from publicly available lectures on Coursera.
For both English--Japanese and English--Chinese lecture translations, we extracted parallel corpora of approximately 50,000 lines and created development and test sets.
This study also suggests guidelines for gathering and cleaning corpora, mining parallel sentences, cleaning noise in the mined data, and creating high-quality evaluation splits.
arXiv Detail & Related papers (2023-11-07T03:50:25Z) - HistRED: A Historical Document-Level Relation Extraction Dataset [32.96963890713529]
HistRED is constructed from Yeonhaengnok, a collection of records originally written in Hanja, the classical Chinese writing.
HistRED provides bilingual annotations such that RE can be performed on Korean and Hanja texts.
We propose a bilingual RE model that leverages both Korean and Hanja contexts to predict relations between entities.
arXiv Detail & Related papers (2023-07-10T00:24:27Z) - UzbekTagger: The rule-based POS tagger for Uzbek language [0.0]
This research paper presents a part-of-speech annotated dataset and tagger tool for the low-resource Uzbek language.
The dataset includes 12 tags, which were used to develop a rule-based POS-tagger tool.
The presented dataset is the first of its kind to be made publicly available for Uzbek, and the POS-tagger tool created can also be used as a pivot to use as a base for other closely-related Turkic languages.
arXiv Detail & Related papers (2023-01-30T07:40:45Z) - RuCoCo: a new Russian corpus with coreference annotation [69.3939291118954]
We present a new corpus with coreference annotation, Russian Coreference Corpus (RuCoCo)
RuCoCo contains news texts in Russian, part of which were annotated from scratch, and for the rest the machine-generated annotations were refined by human annotators.
The size of our corpus is one million words and around 150,000 mentions.
arXiv Detail & Related papers (2022-06-10T07:50:09Z) - Penn-Helsinki Parsed Corpus of Early Modern English: First Parsing
Results and Analysis [2.8749014299466444]
We present the first parsing results on the Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME), a 1.9 million word treebank.
We describe key features of PPCEME that make it challenging for parsing, including a larger and more varied set of function tags than in the Penn Treebank.
arXiv Detail & Related papers (2021-12-15T23:56:21Z) - BiSECT: Learning to Split and Rephrase Sentences with Bitexts [25.385804867037937]
We introduce a novel dataset and a new model for this split and rephrase' task.
BiSECT training data consists of 1 million long English sentences paired with shorter, meaning-equivalent English sentences.
We categorize examples in our corpus, and use these categories in a novel model that allows us to target specific regions of the input sentence to be split and edited.
arXiv Detail & Related papers (2021-09-10T17:30:14Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.