Revisiting and Advancing Chinese Natural Language Understanding with
Accelerated Heterogeneous Knowledge Pre-training
- URL: http://arxiv.org/abs/2210.05287v2
- Date: Wed, 12 Oct 2022 02:11:10 GMT
- Title: Revisiting and Advancing Chinese Natural Language Understanding with
Accelerated Heterogeneous Knowledge Pre-training
- Authors: Taolin Zhang, Junwei Dong, Jianing Wang, Chengyu Wang, Ang Wang,
Yinghui Liu, Jun Huang, Yong Li, Xiaofeng He
- Abstract summary: Unlike English, there is a lack of high-performing open-source Chinese KEPLMs in the natural language processing (NLP) community to support various language understanding applications.
Here, we revisit and advance the development of Chinese natural language understanding with a series of novel Chinese KEPLMs released in various parameter sizes.
Specifically, both relational and linguistic knowledge is effectively injected into CKBERT based on two novel pre-training tasks.
- Score: 25.510288465345592
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, knowledge-enhanced pre-trained language models (KEPLMs) improve
context-aware representations via learning from structured relations in
knowledge graphs, and/or linguistic knowledge from syntactic or dependency
analysis. Unlike English, there is a lack of high-performing open-source
Chinese KEPLMs in the natural language processing (NLP) community to support
various language understanding applications. In this paper, we revisit and
advance the development of Chinese natural language understanding with a series
of novel Chinese KEPLMs released in various parameter sizes, namely CKBERT
(Chinese knowledge-enhanced BERT).Specifically, both relational and linguistic
knowledge is effectively injected into CKBERT based on two novel pre-training
tasks, i.e., linguistic-aware masked language modeling and contrastive
multi-hop relation modeling. Based on the above two pre-training paradigms and
our in-house implemented TorchAccelerator, we have pre-trained base (110M),
large (345M) and huge (1.3B) versions of CKBERT efficiently on GPU clusters.
Experiments demonstrate that CKBERT outperforms strong baselines for Chinese
over various benchmark NLP tasks and in terms of different model sizes.
Related papers
- Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - Commonsense Knowledge Transfer for Pre-trained Language Models [83.01121484432801]
We introduce commonsense knowledge transfer, a framework to transfer the commonsense knowledge stored in a neural commonsense knowledge model to a general-purpose pre-trained language model.
It first exploits general texts to form queries for extracting commonsense knowledge from the neural commonsense knowledge model.
It then refines the language model with two self-supervised objectives: commonsense mask infilling and commonsense relation prediction.
arXiv Detail & Related papers (2023-06-04T15:44:51Z) - Multi-level Distillation of Semantic Knowledge for Pre-training
Multilingual Language Model [15.839724725094916]
Multi-level Multilingual Knowledge Distillation (MMKD) is a novel method for improving multilingual language models.
We employ a teacher-student framework to adopt rich semantic representation knowledge in English BERT.
We conduct experiments on cross-lingual evaluation benchmarks including XNLI, PAWS-X, and XQuAD.
arXiv Detail & Related papers (2022-11-02T15:23:13Z) - High-resource Language-specific Training for Multilingual Neural Machine
Translation [109.31892935605192]
We propose the multilingual translation model with the high-resource language-specific training (HLT-MT) to alleviate the negative interference.
Specifically, we first train the multilingual model only with the high-resource pairs and select the language-specific modules at the top of the decoder.
HLT-MT is further trained on all available corpora to transfer knowledge from high-resource languages to low-resource languages.
arXiv Detail & Related papers (2022-07-11T14:33:13Z) - Overcoming Language Disparity in Online Content Classification with
Multimodal Learning [22.73281502531998]
Large language models are now the standard to develop state-of-the-art solutions for text detection and classification tasks.
The development of advanced computational techniques and resources is disproportionately focused on the English language.
We explore the promise of incorporating the information contained in images via multimodal machine learning.
arXiv Detail & Related papers (2022-05-19T17:56:02Z) - Linguistic Knowledge in Data Augmentation for Natural Language
Processing: An Example on Chinese Question Matching [0.0]
Two DA programs produce augmented texts by five simple text editing operations.
One is enhanced with a n-gram language model to make it fused with extra linguistic knowledge.
Models trained on both types of the augmented trained sets were found to be outperformed by those directly trained on the associated un-augmented train sets.
arXiv Detail & Related papers (2021-11-29T17:07:49Z) - Knowledge Based Multilingual Language Model [44.70205282863062]
We present a novel framework to pretrain knowledge based multilingual language models (KMLMs)
We generate a large amount of code-switched synthetic sentences and reasoning-based multilingual training data using the Wikidata knowledge graphs.
Based on the intra- and inter-sentence structures of the generated data, we design pretraining tasks to facilitate knowledge learning.
arXiv Detail & Related papers (2021-11-22T02:56:04Z) - Prix-LM: Pretraining for Multilingual Knowledge Base Construction [59.02868906044296]
We propose a unified framework, Prix-LM, for multilingual knowledge construction and completion.
We leverage two types of knowledge, monolingual triples and cross-lingual links, extracted from existing multilingual KBs.
Experiments on standard entity-related tasks, such as link prediction in multiple languages, cross-lingual entity linking and bilingual lexicon induction, demonstrate its effectiveness.
arXiv Detail & Related papers (2021-10-16T02:08:46Z) - Improving Massively Multilingual Neural Machine Translation and
Zero-Shot Translation [81.7786241489002]
Massively multilingual models for neural machine translation (NMT) are theoretically attractive, but often underperform bilingual models and deliver poor zero-shot translations.
We argue that multilingual NMT requires stronger modeling capacity to support language pairs with varying typological characteristics.
We propose random online backtranslation to enforce the translation of unseen training language pairs.
arXiv Detail & Related papers (2020-04-24T17:21:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.