An In-depth Study on Internal Structure of Chinese Words
- URL: http://arxiv.org/abs/2106.00334v1
- Date: Tue, 1 Jun 2021 09:09:51 GMT
- Title: An In-depth Study on Internal Structure of Chinese Words
- Authors: Chen Gong, Saihao Huang, Houquan Zhou, Zhenghua Li, Min Zhang, Zhefeng
Wang, Baoxing Huai, Nicholas Jing Yuan
- Abstract summary: This work proposes to model the deep internal structures of Chinese words as dependency trees with 11 labels for distinguishing syntactic relationships.
We manually annotate a word-internal structure treebank (WIST) consisting of over 30K multi-char words from Chinese Penn Treebank.
We present detailed and interesting analysis on WIST to reveal insights on Chinese word formation.
- Score: 34.864343591706984
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unlike English letters, Chinese characters have rich and specific meanings.
Usually, the meaning of a word can be derived from its constituent characters
in some way. Several previous works on syntactic parsing propose to annotate
shallow word-internal structures for better utilizing character-level
information. This work proposes to model the deep internal structures of
Chinese words as dependency trees with 11 labels for distinguishing syntactic
relationships. First, based on newly compiled annotation guidelines, we
manually annotate a word-internal structure treebank (WIST) consisting of over
30K multi-char words from Chinese Penn Treebank. To guarantee quality, each
word is independently annotated by two annotators and inconsistencies are
handled by a third senior annotator. Second, we present detailed and
interesting analysis on WIST to reveal insights on Chinese word formation.
Third, we propose word-internal structure parsing as a new task, and conduct
benchmark experiments using a competitive dependency parser. Finally, we
present two simple ways to encode word-internal structures, leading to
promising gains on the sentence-level syntactic parsing task.
Related papers
- Integrating Supertag Features into Neural Discontinuous Constituent Parsing [0.0]
Traditional views of constituency demand that constituents consist of adjacent words, common in languages like German.
Transition-based parsing produces trees given raw text input using supervised learning on large annotated corpora.
arXiv Detail & Related papers (2024-10-11T12:28:26Z) - Character-Level Chinese Dependency Parsing via Modeling Latent Intra-Word Structure [11.184330703168893]
This paper proposes modeling latent internal structures within words in Chinese.
A constrained Eisner algorithm is implemented to ensure the compatibility of character-level trees.
A detailed analysis reveals that a coarse-to-fine parsing strategy empowers the model to predict more linguistically plausible intra-word structures.
arXiv Detail & Related papers (2024-06-06T06:23:02Z) - Structured Dialogue Discourse Parsing [79.37200787463917]
discourse parsing aims to uncover the internal structure of a multi-participant conversation.
We propose a principled method that improves upon previous work from two perspectives: encoding and decoding.
Experiments show that our method achieves new state-of-the-art, surpassing the previous model by 2.3 on STAC and 1.5 on Molweni.
arXiv Detail & Related papers (2023-06-26T22:51:01Z) - Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language
Pre-training [50.100992353488174]
We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters.
We name the two core modules of CDBERT as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries.
Our paradigm demonstrates consistent improvements on previous Chinese PLMs across all tasks.
arXiv Detail & Related papers (2023-05-30T05:48:36Z) - Exploiting Word Semantics to Enrich Character Representations of Chinese
Pre-trained Models [12.0190584907439]
We propose a new method to exploit word structure and integrate lexical semantics into character representations of pre-trained models.
We show that our approach achieves superior performance over the basic pre-trained models BERT, BERT-wwm and ERNIE on different Chinese NLP tasks.
arXiv Detail & Related papers (2022-07-13T02:28:08Z) - Incorporating Constituent Syntax for Coreference Resolution [50.71868417008133]
We propose a graph-based method to incorporate constituent syntactic structures.
We also explore to utilise higher-order neighbourhood information to encode rich structures in constituent trees.
Experiments on the English and Chinese portions of OntoNotes 5.0 benchmark show that our proposed model either beats a strong baseline or achieves new state-of-the-art performance.
arXiv Detail & Related papers (2022-02-22T07:40:42Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language
Model Pretraining [48.880840711568425]
We study the influences of three main factors on the Chinese tokenization for pretrained language models.
We propose three kinds of tokenizers: SHUOWEN (meaning Talk Word), the pronunciation-based tokenizers; 2) JIEZI (meaning Solve Character), the glyph-based tokenizers.
We find that SHUOWEN and JIEZI tokenizers can generally outperform conventional single-character tokenizers.
arXiv Detail & Related papers (2021-06-01T11:20:02Z) - End-to-End Chinese Parsing Exploiting Lexicons [15.786281545363448]
We propose an end-to-end Chinese parsing model based on character inputs which jointly learns to output word segmentation, part-of-speech tags and dependency structures.
Our parsing model relies on word-char graph attention networks, which can enrich the character inputs with external word knowledge.
arXiv Detail & Related papers (2020-12-08T12:24:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.