2kenize: Tying Subword Sequences for Chinese Script Conversion
- URL: http://arxiv.org/abs/2005.03375v1
- Date: Thu, 7 May 2020 10:53:05 GMT
- Title: 2kenize: Tying Subword Sequences for Chinese Script Conversion
- Authors: Pranav A, Isabelle Augenstein
- Abstract summary: We propose a model that can disambiguate between mappings and convert between the two scripts.
Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy.
- Score: 54.33749520569979
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Simplified Chinese to Traditional Chinese character conversion is a common
preprocessing step in Chinese NLP. Despite this, current approaches have poor
performance because they do not take into account that a simplified Chinese
character can correspond to multiple traditional characters. Here, we propose a
model that can disambiguate between mappings and convert between the two
scripts. The model is based on subword segmentation, two language models, as
well as a method for mapping between subword sequences. We further construct
benchmark datasets for topic classification and script conversion. Our proposed
method outperforms previous Chinese Character conversion approaches by 6 points
in accuracy. These results are further confirmed in a downstream application,
where 2kenize is used to convert pretraining dataset for topic classification.
An error analysis reveals that our method's particular strengths are in dealing
with code-mixing and named entities.
Related papers
- Tokenization as Finite-State Transduction [24.19959327497118]
We introduce a finite-state framework which can efficiently encode all possible tokenizations of a regular language.
We show that Byte-Pair.
Match (BPE) and MaxPiece (WordPiece) fit within this framework.
An application of this is to guided generation, where the outputs of a language model are constrained to match some pattern.
arXiv Detail & Related papers (2024-10-21T07:10:07Z) - CharSS: Character-Level Transformer Model for Sanskrit Word Segmentation [39.08623113730563]
Subword tokens in Indian languages inherently carry meaning, and isolating them can enhance NLP tasks.
We propose a new approach of utilizing a Character-level Transformer model for Sanskrit Word (CharSS)
We perform experiments on three benchmark datasets to compare the performance of our method against existing methods.
arXiv Detail & Related papers (2024-07-08T18:50:13Z) - BenchCLAMP: A Benchmark for Evaluating Language Models on Syntactic and
Semantic Parsing [55.058258437125524]
We introduce BenchCLAMP, a Benchmark to evaluate Constrained LAnguage Model Parsing.
We benchmark eight language models, including two GPT-3 variants available only through an API.
Our experiments show that encoder-decoder pretrained language models can achieve similar performance or surpass state-of-the-art methods for syntactic and semantic parsing when the model output is constrained to be valid.
arXiv Detail & Related papers (2022-06-21T18:34:11Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - MECT: Multi-Metadata Embedding based Cross-Transformer for Chinese Named
Entity Recognition [21.190288516462704]
This paper presents a novel Multi-metadata Embedding based Cross-Transformer (MECT) to improve the performance of Chinese NER.
Specifically, we use multi-metadata embedding in a two-stream Transformer to integrate Chinese character features with the radical-level embedding.
With the structural characteristics of Chinese characters, MECT can better capture the semantic information of Chinese characters for NER.
arXiv Detail & Related papers (2021-07-12T13:39:06Z) - Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word
Alignment [49.45399359826453]
Cross-lingual language models are typically pretrained with language modeling on multilingual text or parallel sentences.
We introduce denoising word alignment as a new cross-lingual pre-training task.
Experimental results show that our method improves cross-lingual transferability on various datasets.
arXiv Detail & Related papers (2021-06-11T13:36:01Z) - SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language
Model Pretraining [48.880840711568425]
We study the influences of three main factors on the Chinese tokenization for pretrained language models.
We propose three kinds of tokenizers: SHUOWEN (meaning Talk Word), the pronunciation-based tokenizers; 2) JIEZI (meaning Solve Character), the glyph-based tokenizers.
We find that SHUOWEN and JIEZI tokenizers can generally outperform conventional single-character tokenizers.
arXiv Detail & Related papers (2021-06-01T11:20:02Z) - LET: Linguistic Knowledge Enhanced Graph Transformer for Chinese Short
Text Matching [29.318730227080675]
We introduce HowNet as an external knowledge base and propose a Linguistic knowledge Enhanced graph Transformer (LET) to deal with word ambiguity.
Experimental results on two Chinese datasets show that our models outperform various typical text matching approaches.
arXiv Detail & Related papers (2021-02-25T04:01:51Z) - Lexicon-constrained Copying Network for Chinese Abstractive
Summarization [0.0]
Copy mechanism allows sequence-to-sequence models to choose words from the input and put them directly into the output.
Most existing models for Chinese abstractive summarization can only perform character copy.
We propose a lexicon-constrained copying network that models multi-granularity in both memory and decoder.
arXiv Detail & Related papers (2020-10-16T06:59:34Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.