UD-English-CHILDES: A Collected Resource of Gold and Silver Universal Dependencies Trees for Child Language Interactions
- URL: http://arxiv.org/abs/2504.20304v1
- Date: Mon, 28 Apr 2025 23:20:36 GMT
- Title: UD-English-CHILDES: A Collected Resource of Gold and Silver Universal Dependencies Trees for Child Language Interactions
- Authors: Xiulin Yang, Zhuoxuan Ju, Lanni Bu, Zoey Liu, Nathan Schneider,
- Abstract summary: This paper introduces UD-English-CHILDES, the first officially released Universal Dependencies (UD) treebank with consistent and unified annotation guidelines.<n>Our corpus harmonizes annotations from 11 children and their caregivers, totaling over 48k sentences.
- Score: 9.218991698992815
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: CHILDES is a widely used resource of transcribed child and child-directed speech. This paper introduces UD-English-CHILDES, the first officially released Universal Dependencies (UD) treebank derived from previously dependency-annotated CHILDES data with consistent and unified annotation guidelines. Our corpus harmonizes annotations from 11 children and their caregivers, totaling over 48k sentences. We validate existing gold-standard annotations under the UD v2 framework and provide an additional 1M silver-standard sentences, offering a consistent resource for computational and linguistic research.
Related papers
- MaiBaam: A Multi-Dialectal Bavarian Universal Dependency Treebank [56.810282574817414]
We present the first multi-dialect Bavarian treebank (MaiBaam) manually annotated with part-of-speech and syntactic dependency information in Universal Dependencies (UD)
We highlight the morphosyntactic differences between the closely-related Bavarian and German and showcase the rich variability of speakers' orthographies.
Our corpus includes 15k tokens, covering dialects from all Bavarian-speaking areas spanning three countries.
arXiv Detail & Related papers (2024-03-15T13:33:10Z) - Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark [39.01204607174688]
We introduce Universal NER (UNER), an open, community-driven project to develop gold-standard NER benchmarks in many languages.
UNER v1 contains 18 datasets annotated with named entities in a cross-lingual consistent schema across 12 diverse languages.
arXiv Detail & Related papers (2023-11-15T17:09:54Z) - Constructing Code-mixed Universal Dependency Forest for Unbiased
Cross-lingual Relation Extraction [92.84968716013783]
Cross-lingual relation extraction (XRE) aggressively leverage the language-consistent structural features from the universal dependency (UD) resource.
We investigate an unbiased UD-based XRE transfer by constructing a type of code-mixed UD forest.
With such forest features, the gaps of UD-based XRE between the training and predicting phases can be effectively closed.
arXiv Detail & Related papers (2023-05-20T18:24:06Z) - NusaCrowd: Open Source Initiative for Indonesian NLP Resources [104.5381571820792]
NusaCrowd is a collaborative initiative to collect and unify existing resources for Indonesian languages.
Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.
arXiv Detail & Related papers (2022-12-19T17:28:22Z) - Developing Universal Dependency Treebanks for Magahi and Braj [0.7349727826230861]
In this paper, we discuss the development of treebanks for two low-resourced Indian languages - Magahi and Braj.
The Magahi treebank contains 945 sentences and Braj treebank around 500 sentences marked with their lemmas, part-of-speech, morphological features and universal dependencies.
arXiv Detail & Related papers (2022-04-26T23:43:41Z) - Cross-linguistically Consistent Semantic and Syntactic Annotation of Child-directed Speech [27.657676278734534]
This paper proposes a methodology for constructing such corpora of child directed speech paired with sentential logical forms.
The approach enforces a cross-linguistically consistent representation, building on recent advances in dependency representation and semantic parsing.
arXiv Detail & Related papers (2021-09-22T18:17:06Z) - Multilingual Irony Detection with Dependency Syntax and Neural Models [61.32653485523036]
It focuses on the contribution from syntactic knowledge, exploiting linguistic resources where syntax is annotated according to the Universal Dependencies scheme.
The results suggest that fine-grained dependency-based syntactic information is informative for the detection of irony.
arXiv Detail & Related papers (2020-11-11T11:22:05Z) - Treebanking User-Generated Content: a UD Based Overview of Guidelines,
Corpora and Unified Recommendations [58.50167394354305]
This article presents a discussion on the main linguistic phenomena which cause difficulties in the analysis of user-generated texts found on the web and in social media.
It proposes a set of tentative UD-based annotation guidelines to promote consistent treatment of the particular phenomena found in these types of texts.
arXiv Detail & Related papers (2020-11-03T23:34:42Z) - Prague Dependency Treebank -- Consolidated 1.0 [1.7147127043116672]
Prague Dependency Treebank-Consolidated 1.0 (PDT-C 1.0)
PDT-C 1.0 contains four different datasets of Czech, uniformly annotated using the standard PDT scheme.
Altogether, the treebank contains around 180,000 sentences with their morphological, surface and deep syntactic annotation.
arXiv Detail & Related papers (2020-06-05T20:52:55Z) - Universal Dependencies v2: An Evergrowing Multilingual Treebank
Collection [33.86322085911299]
Universal Dependencies is an open community effort to create cross-linguistically consistent treebank annotation for many languages.
We describe version 2 of the guidelines (UD v2), discuss the major changes from UD v1 to UD v2, and give an overview of the currently available treebanks for 90 languages.
arXiv Detail & Related papers (2020-04-22T15:38:18Z) - Word Sense Disambiguation for 158 Languages using Word Embeddings Only [80.79437083582643]
Disambiguation of word senses in context is easy for humans, but a major challenge for automatic approaches.
We present a method that takes as input a standard pre-trained word embedding model and induces a fully-fledged word sense inventory.
We use this method to induce a collection of sense inventories for 158 languages on the basis of the original pre-trained fastText word embeddings.
arXiv Detail & Related papers (2020-03-14T14:50:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.