SpellGCN: Incorporating Phonological and Visual Similarities into
Language Models for Chinese Spelling Check
- URL: http://arxiv.org/abs/2004.14166v2
- Date: Wed, 13 May 2020 07:23:11 GMT
- Title: SpellGCN: Incorporating Phonological and Visual Similarities into
Language Models for Chinese Spelling Check
- Authors: Xingyi Cheng, Weidi Xu, Kunlong Chen, Shaohua Jiang, Feng Wang,
Taifeng Wang, Wei Chu, Yuan Qi
- Abstract summary: Chinese Spelling Check (CSC) is a task to detect and correct spelling errors in Chinese natural language.
Existing methods have made attempts to incorporate the similarity knowledge between Chinese characters.
This paper proposes to incorporate phonological and visual similarity into language models for CSC via a specialized graph convolutional network (SpellGCN)
- Score: 28.446849414110297
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Chinese Spelling Check (CSC) is a task to detect and correct spelling errors
in Chinese natural language. Existing methods have made attempts to incorporate
the similarity knowledge between Chinese characters. However, they take the
similarity knowledge as either an external input resource or just heuristic
rules. This paper proposes to incorporate phonological and visual similarity
knowledge into language models for CSC via a specialized graph convolutional
network (SpellGCN). The model builds a graph over the characters, and SpellGCN
is learned to map this graph into a set of inter-dependent character
classifiers. These classifiers are applied to the representations extracted by
another network, such as BERT, enabling the whole network to be end-to-end
trainable. Experiments (The dataset and all code for this paper are available
at https://github.com/ACL2020SpellGCN/SpellGCN) are conducted on three
human-annotated datasets. Our method achieves superior performance against
previous models by a large margin.
Related papers
- Egalitarian Language Representation in Language Models: It All Begins with Tokenizers [0.0]
We show that not all tokenizers offer fair representation for complex script languages such as Tamil, Sinhala, and Hindi.
We introduce an improvement to the Byte Pair algorithm by incorporating graphemes, which we term Grapheme Pair.
Our experiments show that grapheme-based character extraction outperforms byte-level tokenizers for complex scripts.
arXiv Detail & Related papers (2024-09-17T19:05:37Z) - Chinese Text Recognition with A Pre-Trained CLIP-Like Model Through
Image-IDS Aligning [61.34060587461462]
We propose a two-stage framework for Chinese Text Recognition (CTR)
We pre-train a CLIP-like model through aligning printed character images and Ideographic Description Sequences (IDS)
This pre-training stage simulates humans recognizing Chinese characters and obtains the canonical representation of each character.
The learned representations are employed to supervise the CTR model, such that traditional single-character recognition can be improved to text-line recognition.
arXiv Detail & Related papers (2023-09-03T05:33:16Z) - Chinese Financial Text Emotion Mining: GCGTS -- A Character
Relationship-based Approach for Simultaneous Aspect-Opinion Pair Extraction [7.484918031250864]
Aspect-Opinion Pair Extraction (AOPE) from Chinese financial texts is a specialized task in fine-grained text sentiment analysis.
Previous studies have mainly focused on developing grid annotation schemes within grid-based models to facilitate this extraction process.
We propose a novel method called Graph-based Character-level Grid Tagging Scheme (GCGTS)
The GCGTS method explicitly incorporates syntactic structure using Graph Convolutional Networks (GCN) and unifies the encoding of characters within the same semantic unit (Chinese word level)
arXiv Detail & Related papers (2023-08-04T02:20:56Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language
Model Pretraining [48.880840711568425]
We study the influences of three main factors on the Chinese tokenization for pretrained language models.
We propose three kinds of tokenizers: SHUOWEN (meaning Talk Word), the pronunciation-based tokenizers; 2) JIEZI (meaning Solve Character), the glyph-based tokenizers.
We find that SHUOWEN and JIEZI tokenizers can generally outperform conventional single-character tokenizers.
arXiv Detail & Related papers (2021-06-01T11:20:02Z) - Read, Listen, and See: Leveraging Multimodal Information Helps Chinese
Spell Checking [20.74049189959078]
We propose a Chinese spell checker called ReaLiSe, by directly leveraging the multimodal information of the Chinese characters.
The ReaLiSe tackles model the CSC task by (1) capturing the semantic, phonetic and graphic information of the input characters, and (2) mixing the information in these modalities to predict the correct output.
Experiments on the SIGHAN benchmarks show that the proposed model outperforms strong baselines by a large margin.
arXiv Detail & Related papers (2021-05-26T02:38:11Z) - Cross-lingual Text Classification with Heterogeneous Graph Neural
Network [2.6936806968297913]
Cross-lingual text classification aims at training a classifier on the source language and transferring the knowledge to target languages.
Recent multilingual pretrained language models (mPLM) achieve impressive results in cross-lingual classification tasks.
We propose a simple yet effective method to incorporate heterogeneous information within and across languages for cross-lingual text classification.
arXiv Detail & Related papers (2021-05-24T12:45:42Z) - BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models
Identify Analogies? [35.381345454627]
We analyze the capabilities of transformer-based language models on an unsupervised task of identifying analogies.
Off-the-shelf language models can identify analogies to a certain extent, but struggle with abstract and complex relations.
Our results raise important questions for future work about how, and to what extent, pre-trained language models capture knowledge about abstract semantic relations.
arXiv Detail & Related papers (2021-05-11T11:38:49Z) - Lattice-BERT: Leveraging Multi-Granularity Representations in Chinese
Pre-trained Language Models [62.41139712595334]
We propose a novel pre-training paradigm for Chinese -- Lattice-BERT.
We construct a lattice graph from the characters and words in a sentence and feed all these text units into transformers.
We show that our model can bring an average increase of 1.5% under the 12-layer setting.
arXiv Detail & Related papers (2021-04-15T02:36:49Z) - Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images.
"vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora.
Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z) - GATE: Graph Attention Transformer Encoder for Cross-lingual Relation and
Event Extraction [107.8262586956778]
We introduce graph convolutional networks (GCNs) with universal dependency parses to learn language-agnostic sentence representations.
GCNs struggle to model words with long-range dependencies or are not directly connected in the dependency tree.
We propose to utilize the self-attention mechanism to learn the dependencies between words with different syntactic distances.
arXiv Detail & Related papers (2020-10-06T20:30:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.