MECT: Multi-Metadata Embedding based Cross-Transformer for Chinese Named
Entity Recognition
- URL: http://arxiv.org/abs/2107.05418v1
- Date: Mon, 12 Jul 2021 13:39:06 GMT
- Title: MECT: Multi-Metadata Embedding based Cross-Transformer for Chinese Named
Entity Recognition
- Authors: Shuang Wu, Xiaoning Song and Zhenhua Feng
- Abstract summary: This paper presents a novel Multi-metadata Embedding based Cross-Transformer (MECT) to improve the performance of Chinese NER.
Specifically, we use multi-metadata embedding in a two-stream Transformer to integrate Chinese character features with the radical-level embedding.
With the structural characteristics of Chinese characters, MECT can better capture the semantic information of Chinese characters for NER.
- Score: 21.190288516462704
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recently, word enhancement has become very popular for Chinese Named Entity
Recognition (NER), reducing segmentation errors and increasing the semantic and
boundary information of Chinese words. However, these methods tend to ignore
the information of the Chinese character structure after integrating the
lexical information. Chinese characters have evolved from pictographs since
ancient times, and their structure often reflects more information about the
characters. This paper presents a novel Multi-metadata Embedding based
Cross-Transformer (MECT) to improve the performance of Chinese NER by fusing
the structural information of Chinese characters. Specifically, we use
multi-metadata embedding in a two-stream Transformer to integrate Chinese
character features with the radical-level embedding. With the structural
characteristics of Chinese characters, MECT can better capture the semantic
information of Chinese characters for NER. The experimental results obtained on
several well-known benchmarking datasets demonstrate the merits and superiority
of the proposed MECT method.\footnote{The source code of the proposed method is
publicly available at https://github.com/CoderMusou/MECT4CNER.
Related papers
- Efficient and Scalable Chinese Vector Font Generation via Component Composition [13.499566877003408]
We introduce the first efficient and scalable Chinese vector font generation approach via component composition.
We propose a framework based on spatial transformer networks (STN) and multiple losses tailored to font characteristics.
Our experiments have demonstrated that our method significantly surpasses the state-of-the-art vector font generation methods.
arXiv Detail & Related papers (2024-04-10T06:39:18Z) - Chinese Text Recognition with A Pre-Trained CLIP-Like Model Through
Image-IDS Aligning [61.34060587461462]
We propose a two-stage framework for Chinese Text Recognition (CTR)
We pre-train a CLIP-like model through aligning printed character images and Ideographic Description Sequences (IDS)
This pre-training stage simulates humans recognizing Chinese characters and obtains the canonical representation of each character.
The learned representations are employed to supervise the CTR model, such that traditional single-character recognition can be improved to text-line recognition.
arXiv Detail & Related papers (2023-09-03T05:33:16Z) - mCL-NER: Cross-Lingual Named Entity Recognition via Multi-view
Contrastive Learning [54.523172171533645]
Cross-lingual named entity recognition (CrossNER) faces challenges stemming from uneven performance due to the scarcity of multilingual corpora.
We propose Multi-view Contrastive Learning for Cross-lingual Named Entity Recognition (mCL-NER)
Our experiments on the XTREME benchmark, spanning 40 languages, demonstrate the superiority of mCL-NER over prior data-driven and model-based approaches.
arXiv Detail & Related papers (2023-08-17T16:02:29Z) - Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language
Pre-training [50.100992353488174]
We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters.
We name the two core modules of CDBERT as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries.
Our paradigm demonstrates consistent improvements on previous Chinese PLMs across all tasks.
arXiv Detail & Related papers (2023-05-30T05:48:36Z) - MFE-NER: Multi-feature Fusion Embedding for Chinese Named Entity Recognition [9.226556208419256]
We propose a lightweight method, MFE-NER, which fuses glyph and phonetic features.
In the glyph domain, we disassemble Chinese characters into Five-Stroke components to represent structure features.
In the phonetic domain, an improved phonetic system is proposed in our work, making it reasonable to describe phonetic similarity among Chinese characters.
arXiv Detail & Related papers (2021-09-16T11:16:43Z) - ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin
Information [32.70080326854314]
We propose ChineseBERT, which incorporates the glyph and pinyin information of Chinese characters into language model pretraining.
The proposed ChineseBERT model yields significant performance boost over baseline models with fewer training steps.
arXiv Detail & Related papers (2021-06-30T13:06:00Z) - SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language
Model Pretraining [48.880840711568425]
We study the influences of three main factors on the Chinese tokenization for pretrained language models.
We propose three kinds of tokenizers: SHUOWEN (meaning Talk Word), the pronunciation-based tokenizers; 2) JIEZI (meaning Solve Character), the glyph-based tokenizers.
We find that SHUOWEN and JIEZI tokenizers can generally outperform conventional single-character tokenizers.
arXiv Detail & Related papers (2021-06-01T11:20:02Z) - LET: Linguistic Knowledge Enhanced Graph Transformer for Chinese Short
Text Matching [29.318730227080675]
We introduce HowNet as an external knowledge base and propose a Linguistic knowledge Enhanced graph Transformer (LET) to deal with word ambiguity.
Experimental results on two Chinese datasets show that our models outperform various typical text matching approaches.
arXiv Detail & Related papers (2021-02-25T04:01:51Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - 2kenize: Tying Subword Sequences for Chinese Script Conversion [54.33749520569979]
We propose a model that can disambiguate between mappings and convert between the two scripts.
Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy.
arXiv Detail & Related papers (2020-05-07T10:53:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.