Proposing TAGbank as a Corpus of Tree-Adjoining Grammar Derivations
- URL: http://arxiv.org/abs/2504.05226v2
- Date: Sun, 13 Apr 2025 00:01:05 GMT
- Title: Proposing TAGbank as a Corpus of Tree-Adjoining Grammar Derivations
- Authors: Jungyeul Park,
- Abstract summary: We introduce TAGbank, a corpus of TAG derivations automatically extracted from existing syntactic treebanks.<n>This paper outlines a methodology for mapping phrase-structure annotations to TAG derivations, leveraging the generative power of TAG to support parsing, grammar induction, and semantic analysis.<n>We also discuss the challenges involved in the extraction process, including ensuring consistency across treebank schemes and dealing with language-specific syntactic idiosyncrasies.
- Score: 1.0619039878979954
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The development of lexicalized grammars, particularly Tree-Adjoining Grammar (TAG), has significantly advanced our understanding of syntax and semantics in natural language processing (NLP). While existing syntactic resources like the Penn Treebank and Universal Dependencies offer extensive annotations for phrase-structure and dependency parsing, there is a lack of large-scale corpora grounded in lexicalized grammar formalisms. To address this gap, we introduce TAGbank, a corpus of TAG derivations automatically extracted from existing syntactic treebanks. This paper outlines a methodology for mapping phrase-structure annotations to TAG derivations, leveraging the generative power of TAG to support parsing, grammar induction, and semantic analysis. Our approach builds on the work of CCGbank, extending it to incorporate the unique structural properties of TAG, including its transparent derivation trees and its ability to capture long-distance dependencies. We also discuss the challenges involved in the extraction process, including ensuring consistency across treebank schemes and dealing with language-specific syntactic idiosyncrasies. Finally, we propose the future extension of TAGbank to include multilingual corpora, focusing on the Penn Korean and Penn Chinese Treebanks, to explore the cross-linguistic application of TAG's formalism. By providing a robust, derivation-based resource, TAGbank aims to support a wide range of computational tasks and contribute to the theoretical understanding of TAG's generative capacity.
Related papers
- Integrating Supertag Features into Neural Discontinuous Constituent Parsing [0.0]
Traditional views of constituency demand that constituents consist of adjacent words, common in languages like German.
Transition-based parsing produces trees given raw text input using supervised learning on large annotated corpora.
arXiv Detail & Related papers (2024-10-11T12:28:26Z) - Character-Level Chinese Dependency Parsing via Modeling Latent Intra-Word Structure [11.184330703168893]
This paper proposes modeling latent internal structures within words in Chinese.
A constrained Eisner algorithm is implemented to ensure the compatibility of character-level trees.
A detailed analysis reveals that a coarse-to-fine parsing strategy empowers the model to predict more linguistically plausible intra-word structures.
arXiv Detail & Related papers (2024-06-06T06:23:02Z) - Assessment of Pre-Trained Models Across Languages and Grammars [7.466159270333272]
We aim to recover constituent and dependency structures by casting parsing as sequence labeling.
Our results show that pre-trained word vectors do not favor constituency representations of syntax over dependencies.
occurrence of a language in the pretraining data is more important than the amount of task data when recovering syntax from the word vectors.
arXiv Detail & Related papers (2023-09-20T09:23:36Z) - mCL-NER: Cross-Lingual Named Entity Recognition via Multi-view
Contrastive Learning [54.523172171533645]
Cross-lingual named entity recognition (CrossNER) faces challenges stemming from uneven performance due to the scarcity of multilingual corpora.
We propose Multi-view Contrastive Learning for Cross-lingual Named Entity Recognition (mCL-NER)
Our experiments on the XTREME benchmark, spanning 40 languages, demonstrate the superiority of mCL-NER over prior data-driven and model-based approaches.
arXiv Detail & Related papers (2023-08-17T16:02:29Z) - Prompting Language Models for Linguistic Structure [73.11488464916668]
We present a structured prompting approach for linguistic structured prediction tasks.
We evaluate this approach on part-of-speech tagging, named entity recognition, and sentence chunking.
We find that while PLMs contain significant prior knowledge of task labels due to task leakage into the pretraining corpus, structured prompting can also retrieve linguistic structure with arbitrary labels.
arXiv Detail & Related papers (2022-11-15T01:13:39Z) - Incorporating Constituent Syntax for Coreference Resolution [50.71868417008133]
We propose a graph-based method to incorporate constituent syntactic structures.
We also explore to utilise higher-order neighbourhood information to encode rich structures in constituent trees.
Experiments on the English and Chinese portions of OntoNotes 5.0 benchmark show that our proposed model either beats a strong baseline or achieves new state-of-the-art performance.
arXiv Detail & Related papers (2022-02-22T07:40:42Z) - Combining Improvements for Exploiting Dependency Trees in Neural
Semantic Parsing [1.0437764544103274]
In this paper, we examine three methods to incorporate such dependency information in a Transformer based semantic parsing system.
We first replace standard self-attention heads in the encoder with parent-scaled self-attention (PASCAL) heads.
Later, we insert the constituent attention (CA) to the encoder, which adds an extra constraint to attention heads that can better capture the inherent dependency structure of input sentences.
arXiv Detail & Related papers (2021-12-25T03:41:42Z) - Lexically-constrained Text Generation through Commonsense Knowledge
Extraction and Injection [62.071938098215085]
We focus on the Commongen benchmark, wherein the aim is to generate a plausible sentence for a given set of input concepts.
We propose strategies for enhancing the semantic correctness of the generated text.
arXiv Detail & Related papers (2020-12-19T23:23:40Z) - MEGA RST Discourse Treebanks with Structure and Nuclearity from Scalable
Distant Sentiment Supervision [30.615883375573432]
We present a novel methodology to automatically generate discourse treebanks using distant supervision from sentiment-annotated datasets.
Our approach generates trees incorporating structure and nuclearity for documents of arbitrary length by relying on an efficient beam-search strategy.
Experiments indicate that a discourse trained on our MEGA-DT treebank delivers promising inter-domain performance gains.
arXiv Detail & Related papers (2020-11-05T18:22:38Z) - GATE: Graph Attention Transformer Encoder for Cross-lingual Relation and
Event Extraction [107.8262586956778]
We introduce graph convolutional networks (GCNs) with universal dependency parses to learn language-agnostic sentence representations.
GCNs struggle to model words with long-range dependencies or are not directly connected in the dependency tree.
We propose to utilize the self-attention mechanism to learn the dependencies between words with different syntactic distances.
arXiv Detail & Related papers (2020-10-06T20:30:35Z) - Constructing a Family Tree of Ten Indo-European Languages with
Delexicalized Cross-linguistic Transfer Patterns [57.86480614673034]
We formalize the delexicalized transfer as interpretable tree-to-string and tree-to-tree patterns.
This allows us to quantitatively probe cross-linguistic transfer and extend inquiries of Second Language Acquisition.
arXiv Detail & Related papers (2020-07-17T15:56:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.