Prague Dependency Treebank -- Consolidated 1.0
- URL: http://arxiv.org/abs/2006.03679v1
- Date: Fri, 5 Jun 2020 20:52:55 GMT
- Title: Prague Dependency Treebank -- Consolidated 1.0
- Authors: Jan Haji\v{c}, Eduard Bej\v{c}ek, Jaroslava Hlav\'a\v{c}ov\'a, Marie
Mikulov\'a, Milan Straka, Jan \v{S}t\v{e}p\'anek, Barbora
\v{S}t\v{e}p\'ankov\'a
- Abstract summary: Prague Dependency Treebank-Consolidated 1.0 (PDT-C 1.0)
PDT-C 1.0 contains four different datasets of Czech, uniformly annotated using the standard PDT scheme.
Altogether, the treebank contains around 180,000 sentences with their morphological, surface and deep syntactic annotation.
- Score: 1.7147127043116672
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a richly annotated and genre-diversified language resource, the
Prague Dependency Treebank-Consolidated 1.0 (PDT-C 1.0), the purpose of which
is - as it always been the case for the family of the Prague Dependency
Treebanks - to serve both as a training data for various types of NLP tasks as
well as for linguistically-oriented research. PDT-C 1.0 contains four different
datasets of Czech, uniformly annotated using the standard PDT scheme (albeit
not everything is annotated manually, as we describe in detail here). The texts
come from different sources: daily newspaper articles, Czech translation of the
Wall Street Journal, transcribed dialogs and a small amount of user-generated,
short, often non-standard language segments typed into a web translator.
Altogether, the treebank contains around 180,000 sentences with their
morphological, surface and deep syntactic annotation. The diversity of the
texts and annotations should serve well the NLP applications as well as it is
an invaluable resource for linguistic research, including comparative studies
regarding texts of different genres. The corpus is publicly and freely
available.
Related papers
- DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages [49.38663048447942]
We propose DIALECTBENCH, the first-ever large-scale benchmark for NLP on varieties.
This allows for a comprehensive evaluation of NLP system performance on different language varieties.
We provide substantial evidence of performance disparities between standard and non-standard language varieties.
arXiv Detail & Related papers (2024-03-16T20:18:36Z) - MaiBaam: A Multi-Dialectal Bavarian Universal Dependency Treebank [56.810282574817414]
We present the first multi-dialect Bavarian treebank (MaiBaam) manually annotated with part-of-speech and syntactic dependency information in Universal Dependencies (UD)
We highlight the morphosyntactic differences between the closely-related Bavarian and German and showcase the rich variability of speakers' orthographies.
Our corpus includes 15k tokens, covering dialects from all Bavarian-speaking areas spanning three countries.
arXiv Detail & Related papers (2024-03-15T13:33:10Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - Examining Cross-lingual Contextual Embeddings with Orthogonal Structural
Probes [0.2538209532048867]
A novel Orthogonal Structural Probe (Limisiewicz and Marevcek, 2021) allows us to answer this question for specific linguistic features.
We evaluate syntactic (UD) and lexical (WordNet) structural information encoded inmBERT's contextual representations for nine diverse languages.
We successfully apply our findings to zero-shot and few-shot cross-lingual parsing.
arXiv Detail & Related papers (2021-09-10T15:03:11Z) - ParCourE: A Parallel Corpus Explorer for a Massively Multilingual Corpus [2.7036498789349244]
Researching typological properties of languages is fundamental for progress in multilingual NLP.
We provide ParCourE, an online tool that allows to browse a word-aligned parallel corpus, covering 1334 languages.
arXiv Detail & Related papers (2021-07-14T12:16:21Z) - A Multi-Perspective Architecture for Semantic Code Search [58.73778219645548]
We propose a novel multi-perspective cross-lingual neural framework for code--text matching.
Our experiments on the CoNaLa dataset show that our proposed model yields better performance than previous approaches.
arXiv Detail & Related papers (2020-05-06T04:46:11Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z) - Universal Dependencies v2: An Evergrowing Multilingual Treebank
Collection [33.86322085911299]
Universal Dependencies is an open community effort to create cross-linguistically consistent treebank annotation for many languages.
We describe version 2 of the guidelines (UD v2), discuss the major changes from UD v1 to UD v2, and give an overview of the currently available treebanks for 90 languages.
arXiv Detail & Related papers (2020-04-22T15:38:18Z) - Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z) - Parsing Early Modern English for Linguistic Search [3.927039542429003]
We investigate whether advances in NLP make it possible to vastly increase the size of data usable for research in historical syntax.
This brings together many of the usual tools in NLP - word embeddings, tagging, and parsing - in the service of linguistic queries over automatically annotated corpora.
We train a part-of-speech (POS) tagger and on a corpus of historical English, using ELMo embeddings trained over a billion words of similar text.
arXiv Detail & Related papers (2020-02-24T21:04:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.