Related papers: An Attempt to Develop a Neural Parser based on Simplified Head-Driven Phrase Structure Grammar on Vietnamese

An Attempt to Develop a Neural Parser based on Simplified Head-Driven Phrase Structure Grammar on Vietnamese

URL: http://arxiv.org/abs/2411.17270v1
Date: Tue, 26 Nov 2024 09:46:32 GMT
Title: An Attempt to Develop a Neural Parser based on Simplified Head-Driven Phrase Structure Grammar on Vietnamese
Authors: Duc-Vu Nguyen, Thang Chau Phan, Quoc-Nam Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen,
Abstract summary: Existing Vietnamese corpora did not adhere to simplified Head-Driven Phrase Structure Grammar rules. We modified the first Penn Treebank by replacing it with the PhoBERT or XLM-Roa models, which can encode Vietnamese texts. Our experiments showed that the simplified HPSG Neural corpora achieved a new state-of-the-art F-score of 82% for constituency parsing.
Score: 1.4990724156326336
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: In this paper, we aimed to develop a neural parser for Vietnamese based on simplified Head-Driven Phrase Structure Grammar (HPSG). The existing corpora, VietTreebank and VnDT, had around 15% of constituency and dependency tree pairs that did not adhere to simplified HPSG rules. To attempt to address the issue of the corpora not adhering to simplified HPSG rules, we randomly permuted samples from the training and development sets to make them compliant with simplified HPSG. We then modified the first simplified HPSG Neural Parser for the Penn Treebank by replacing it with the PhoBERT or XLM-RoBERTa models, which can encode Vietnamese texts. We conducted experiments on our modified VietTreebank and VnDT corpora. Our extensive experiments showed that the simplified HPSG Neural Parser achieved a new state-of-the-art F-score of 82% for constituency parsing when using the same predicted part-of-speech (POS) tags as the self-attentive constituency parser. Additionally, it outperformed previous studies in dependency parsing with a higher Unlabeled Attachment Score (UAS). However, our parser obtained lower Labeled Attachment Score (LAS) scores likely due to our focus on arc permutation without changing the original labels, as we did not consult with a linguistic expert. Lastly, the research findings of this paper suggest that simplified HPSG should be given more attention to linguistic expert when developing treebanks for Vietnamese natural language processing.

Related papers

MorphTok: Morphologically Grounded Tokenization for Indian Languages [23.58043476541051]
Tokenization is a crucial step in NLP, especially with the rise of large language models (LLMs) We propose morphology-aware segmentation as a pre-tokenization step prior to applying subword tokenization. We also introduce Constrained BPE, an extension to the traditional BPE algorithm that incorporates script-specific constraints.
arXiv Detail & Related papers (2025-04-14T15:44:45Z)
DNA-GPT: Divergent N-Gram Analysis for Training-Free Detection of GPT-Generated Text [82.5469544192645]
We propose a novel training-free detection strategy called Divergent N-Gram Analysis (DNA-GPT) By analyzing the differences between the original and new remaining parts through N-gram analysis, we unveil significant discrepancies between the distribution of machine-generated text and human-written text. Results show that our zero-shot approach exhibits state-of-the-art performance in distinguishing between human and GPT-generated text.
arXiv Detail & Related papers (2023-05-27T03:58:29Z)
Cascading and Direct Approaches to Unsupervised Constituency Parsing on Spoken Sentences [67.37544997614646]
We present the first study on unsupervised spoken constituency parsing. The goal is to determine the spoken sentences' hierarchical syntactic structure in the form of constituency parse trees. We show that accurate segmentation alone may be sufficient to parse spoken sentences accurately.
arXiv Detail & Related papers (2023-03-15T17:57:22Z)
Classifiers are Better Experts for Controllable Text Generation [63.17266060165098]
We show that the proposed method significantly outperforms recent PPLM, GeDi, and DExperts on PPL and sentiment accuracy based on the external classifier of generated texts. The same time, it is also easier to implement and tune, and has significantly fewer restrictions and requirements.
arXiv Detail & Related papers (2022-05-15T12:58:35Z)
Penn-Helsinki Parsed Corpus of Early Modern English: First Parsing Results and Analysis [2.8749014299466444]
We present the first parsing results on the Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME), a 1.9 million word treebank. We describe key features of PPCEME that make it challenging for parsing, including a larger and more varied set of function tags than in the Penn Treebank.
arXiv Detail & Related papers (2021-12-15T23:56:21Z)
Head-driven Phrase Structure Parsing in O($n^3$) Time Complexity [48.683350567504604]
Constituent and dependency parsing, the two classic forms of syntactic parsing, have been found to benefit from joint training and decoding under a uniform formalism. We propose an improved head scorer that helps achieve a novel performance-preserved in $O$($n3$) time complexity.
arXiv Detail & Related papers (2021-05-20T15:33:51Z)
Neural Text Generation with Part-of-Speech Guided Softmax [82.63394952538292]
We propose using linguistic annotation, i.e., part-of-speech (POS), to guide the text generation. We show that our proposed methods can generate more diverse text while maintaining comparable quality.
arXiv Detail & Related papers (2021-05-08T08:53:16Z)
Augmenting Part-of-speech Tagging with Syntactic Information for Vietnamese and Chinese [0.32228025627337864]
We implement the idea to improve word segmentation and part of speech tagging of the Vietnamese language by employing a simplified constituency. Our neural model for joint word segmentation and part-of-speech tagging has the architecture of the syllable-based constituency. This model can be augmented with predicted word boundary and part-of-speech tags by other tools.
arXiv Detail & Related papers (2021-02-24T08:57:02Z)
Heads-up! Unsupervised Constituency Parsing via Self-Attention Heads [27.578115452635625]
We propose a novel fully unsupervised parsing approach that extracts constituency trees from PLM attention heads. We rank transformer attention heads based on their inherent properties, and create an ensemble of high-ranking heads to produce the final tree. Our experiments can also be used as a tool to analyze the grammars PLMs learn implicitly.
arXiv Detail & Related papers (2020-10-19T13:51:40Z)
Towards Instance-Level Parser Selection for Cross-Lingual Transfer of Dependency Parsers [59.345145623931636]
We argue for a novel cross-lingual transfer paradigm: instance-level selection (ILPS) We present a proof-of-concept study focused on instance-level selection in the framework of delexicalized transfer.
arXiv Detail & Related papers (2020-04-16T13:18:55Z)
Is POS Tagging Necessary or Even Helpful for Neural Dependency Parsing? [22.93722845643562]
We show that POS tagging can still significantly improve parsing performance when using the Stack joint framework. Considering that it is much cheaper to annotate POS tags than parse trees, we also investigate the utilization of large-scale heterogeneous POS tag data.
arXiv Detail & Related papers (2020-03-06T13:47:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.