Extending dependencies to the taggedPBC: Word order in transitive clauses
- URL: http://arxiv.org/abs/2506.06785v1
- Date: Sat, 07 Jun 2025 12:52:45 GMT
- Title: Extending dependencies to the taggedPBC: Word order in transitive clauses
- Authors: Hiram Ring,
- Abstract summary: This paper reports on a CoNLLU-formatted version of the dataset which transfers dependency information along with POS tags to all languages in the taggedPBC.<n>The dependency-annotated corpora are also made available for research and collaboration via GitHub.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The taggedPBC (Ring 2025a) contains more than 1,800 sentences of pos-tagged parallel text data from over 1,500 languages, representing 133 language families and 111 isolates. While this dwarfs previously available resources, and the POS tags achieve decent accuracy, allowing for predictive crosslinguistic insights (Ring 2025b), the dataset was not initially annotated for dependencies. This paper reports on a CoNLLU-formatted version of the dataset which transfers dependency information along with POS tags to all languages in the taggedPBC. Although there are various concerns regarding the quality of the tags and the dependencies, word order information derived from this dataset regarding the position of arguments and predicates in transitive clauses correlates with expert determinations of word order in three typological databases (WALS, Grambank, Autotyp). This highlights the usefulness of corpus-based typological approaches (as per Baylor et al. 2023; Bjerva 2024) for extending comparisons of discrete linguistic categories, and suggests that important insights can be gained even from noisy data, given sufficient annotation. The dependency-annotated corpora are also made available for research and collaboration via GitHub.
Related papers
- The Cognate Data Bottleneck in Language Phylogenetics [49.1574468325115]
Phylogenetic data analysis approaches that require larger datasets can not be applied to cognate data.<n>It remains an open question how, and if these computational approaches can be applied in historical linguistics.
arXiv Detail & Related papers (2025-07-01T16:14:20Z) - The taggedPBC: Annotating a massive parallel corpus for crosslinguistic investigations [0.0]
The taggedPBC contains more than 1,800 sentences of pos-tagged parallel text data from over 1,500 languages.<n>The accuracy of tags in this dataset is shown to correlate well with both existing SOTA taggers for high-resource languages.<n>A novel measure derived from this dataset, the N1 ratio, correlates with expert determinations of word order in three typological databases.
arXiv Detail & Related papers (2025-05-18T22:13:32Z) - Cross-lingual Contextualized Phrase Retrieval [63.80154430930898]
We propose a new task formulation of dense retrieval, cross-lingual contextualized phrase retrieval.
We train our Cross-lingual Contextualized Phrase Retriever (CCPR) using contrastive learning.
On the phrase retrieval task, CCPR surpasses baselines by a significant margin, achieving a top-1 accuracy that is at least 13 points higher.
arXiv Detail & Related papers (2024-03-25T14:46:51Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Semantic Parsing for Conversational Question Answering over Knowledge
Graphs [63.939700311269156]
We develop a dataset where user questions are annotated with Sparql parses and system answers correspond to execution results thereof.
We present two different semantic parsing approaches and highlight the challenges of the task.
Our dataset and models are released at https://github.com/Edinburgh/SPICE.
arXiv Detail & Related papers (2023-01-28T14:45:11Z) - Query Expansion Using Contextual Clue Sampling with Language Models [69.51976926838232]
We propose a combination of an effective filtering strategy and fusion of the retrieved documents based on the generation probability of each context.
Our lexical matching based approach achieves a similar top-5/top-20 retrieval accuracy and higher top-100 accuracy compared with the well-established dense retrieval model DPR.
For end-to-end QA, the reader model also benefits from our method and achieves the highest Exact-Match score against several competitive baselines.
arXiv Detail & Related papers (2022-10-13T15:18:04Z) - xDBTagger: Explainable Natural Language Interface to Databases Using
Keyword Mappings and Schema Graph [0.17188280334580192]
Translating natural language queries into structured query language (NLQ) in interfaces to relational databases is a challenging task.
We propose xDBTagger, an explainable hybrid translation pipeline that explains the decisions made along the way to the user both textually and visually.
xDBTagger is effective in terms of accuracy and translates the queries more efficiently compared to other state-of-the-art pipeline-based systems up to 10000 times.
arXiv Detail & Related papers (2022-10-07T18:17:09Z) - What Makes Sentences Semantically Related: A Textual Relatedness Dataset
and Empirical Study [31.062129406113588]
We introduce a dataset for Semantic Textual Relatedness, STR-2022, that has 5,500 English sentence pairs manually annotated.
We show that human intuition regarding relatedness of sentence pairs is highly reliable, with a repeat annotation correlation of 0.84.
We also show the utility of STR-2022 for evaluating automatic methods of sentence representation and for various downstream NLP tasks.
arXiv Detail & Related papers (2021-10-10T16:23:54Z) - GATE: Graph Attention Transformer Encoder for Cross-lingual Relation and
Event Extraction [107.8262586956778]
We introduce graph convolutional networks (GCNs) with universal dependency parses to learn language-agnostic sentence representations.
GCNs struggle to model words with long-range dependencies or are not directly connected in the dependency tree.
We propose to utilize the self-attention mechanism to learn the dependencies between words with different syntactic distances.
arXiv Detail & Related papers (2020-10-06T20:30:35Z) - The Annotation Guideline of LST20 Corpus [0.3161954199291541]
The dataset complies to the CoNLL-2003-style format for ease of use.
At a large scale, it consists of 3,164,864 words, 288,020 named entities, 248,962 clauses, and 74,180 sentences.
All 3,745 documents are also annotated with 15 news genres.
arXiv Detail & Related papers (2020-08-12T01:16:45Z) - Prague Dependency Treebank -- Consolidated 1.0 [1.7147127043116672]
Prague Dependency Treebank-Consolidated 1.0 (PDT-C 1.0)
PDT-C 1.0 contains four different datasets of Czech, uniformly annotated using the standard PDT scheme.
Altogether, the treebank contains around 180,000 sentences with their morphological, surface and deep syntactic annotation.
arXiv Detail & Related papers (2020-06-05T20:52:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.