Related papers: Joint Chinese Word Segmentation and Span-based Constituency Parsing

Joint Chinese Word Segmentation and Span-based Constituency Parsing

URL: http://arxiv.org/abs/2211.01638v1
Date: Thu, 3 Nov 2022 08:19:00 GMT
Title: Joint Chinese Word Segmentation and Span-based Constituency Parsing
Authors: Zhicheng Wang, Tianyu Shi, Cong Liu
Abstract summary: This work proposes a method for joint Chinese word segmentation and Span-based Constituency Parsing by adding extra labels to individual Chinese characters on the parse trees. Through experiments, the proposed algorithm outperforms the recent models for joint segmentation and constituency parsing on CTB 5.1.
Score: 11.080040070201608
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In constituency parsing, span-based decoding is an important direction. However, for Chinese sentences, because of their linguistic characteristics, it is necessary to utilize other models to perform word segmentation first, which introduces a series of uncertainties and generally leads to errors in the computation of the constituency tree afterward. This work proposes a method for joint Chinese word segmentation and Span-based Constituency Parsing by adding extra labels to individual Chinese characters on the parse trees. Through experiments, the proposed algorithm outperforms the recent models for joint segmentation and constituency parsing on CTB 5.1.

Related papers

Parsing Through Boundaries in Chinese Word Segmentation [4.74872130711676]
Unlike English, Chinese lacks explicit word boundaries, making segmentation both necessary and inherently ambiguous. This study highlights the intricate relationship between word segmentation and syntactic parsing, providing a clearer understanding of how different segmentation strategies shape dependency structures in Chinese.
arXiv Detail & Related papers (2025-03-29T14:24:02Z)
A General and Flexible Multi-concept Parsing Framework for Multilingual Semantic Matching [60.51839859852572]
We propose to resolve the text into multi concepts for multilingual semantic matching to liberate the model from the reliance on NER models. We conduct comprehensive experiments on English datasets QQP and MRPC, and Chinese dataset Medical-SM.
arXiv Detail & Related papers (2024-03-05T13:55:16Z)
Discourse Representation Structure Parsing for Chinese [8.846860617823005]
We explore the feasibility of Chinese semantic parsing in the absence of labeled data for Chinese meaning representations. We propose a test suite designed explicitly for Chinese semantic parsing, which provides fine-grained evaluation for parsing performance. Our experimental results show that the difficulty of Chinese semantic parsing is mainly caused by adverbs.
arXiv Detail & Related papers (2023-06-16T09:47:45Z)
On Parsing as Tagging [66.31276017088477]
We show how to reduce tetratagging, a state-of-the-art constituency tagger, to shift--reduce parsing. We empirically evaluate our taxonomy of tagging pipelines with different choices of linearizers, learners, and decoders.
arXiv Detail & Related papers (2022-11-14T13:37:07Z)
A Character-level Span-based Model for Mandarin Prosodic Structure Prediction [36.90699361223442]
We propose a span-based Mandarin prosodic structure prediction model to obtain an optimal prosodic structure tree. Rich linguistic features are provided by Chinese character-level BERT and sent to encoder with self-attention architecture. The proposed method can predict prosodic labels of different levels at the same time and accomplish the process directly from Chinese characters.
arXiv Detail & Related papers (2022-03-31T09:47:08Z)
Joint Chinese Word Segmentation and Part-of-speech Tagging via Two-stage Span Labeling [0.2624902795082451]
We propose a neural model named SpanSegTag for joint Chinese word segmentation and part-of-speech tagging. Our experiments show that our BERT-based model SpanSegTag achieved competitive performances on the CTB5, CTB6, and UD datasets.
arXiv Detail & Related papers (2021-12-17T12:59:02Z)
A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space. We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance. We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z)
More Than Words: Collocation Tokenization for Latent Dirichlet Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ. We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z)
SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language Model Pretraining [48.880840711568425]
We study the influences of three main factors on the Chinese tokenization for pretrained language models. We propose three kinds of tokenizers: SHUOWEN (meaning Talk Word), the pronunciation-based tokenizers; 2) JIEZI (meaning Solve Character), the glyph-based tokenizers. We find that SHUOWEN and JIEZI tokenizers can generally outperform conventional single-character tokenizers.
arXiv Detail & Related papers (2021-06-01T11:20:02Z)
Augmenting Part-of-speech Tagging with Syntactic Information for Vietnamese and Chinese [0.32228025627337864]
We implement the idea to improve word segmentation and part of speech tagging of the Vietnamese language by employing a simplified constituency. Our neural model for joint word segmentation and part-of-speech tagging has the architecture of the syllable-based constituency. This model can be augmented with predicted word boundary and part-of-speech tags by other tools.
arXiv Detail & Related papers (2021-02-24T08:57:02Z)
End-to-End Chinese Parsing Exploiting Lexicons [15.786281545363448]
We propose an end-to-end Chinese parsing model based on character inputs which jointly learns to output word segmentation, part-of-speech tags and dependency structures. Our parsing model relies on word-char graph attention networks, which can enrich the character inputs with external word knowledge.
arXiv Detail & Related papers (2020-12-08T12:24:36Z)
2kenize: Tying Subword Sequences for Chinese Script Conversion [54.33749520569979]
We propose a model that can disambiguate between mappings and convert between the two scripts. Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy.
arXiv Detail & Related papers (2020-05-07T10:53:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.