CLOWER: A Pre-trained Language Model with Contrastive Learning over Word
and Character Representations
- URL: http://arxiv.org/abs/2208.10844v1
- Date: Tue, 23 Aug 2022 09:52:34 GMT
- Title: CLOWER: A Pre-trained Language Model with Contrastive Learning over Word
and Character Representations
- Authors: Borun Chen, Hongyin Tang, Jingang Wang, Qifan Wang, Hai-Tao Zheng, Wei
Wu and Liqian Yu
- Abstract summary: Pre-trained Language Models (PLMs) have achieved remarkable performance gains across numerous downstream tasks in natural language understanding.
Most current models use Chinese characters as inputs and are not able to encode semantic information contained in Chinese words.
We propose a simple yet effective PLM CLOWER, which adopts the Contrastive Learning Over Word and charactER representations.
- Score: 18.780841483220986
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained Language Models (PLMs) have achieved remarkable performance gains
across numerous downstream tasks in natural language understanding. Various
Chinese PLMs have been successively proposed for learning better Chinese
language representation. However, most current models use Chinese characters as
inputs and are not able to encode semantic information contained in Chinese
words. While recent pre-trained models incorporate both words and characters
simultaneously, they usually suffer from deficient semantic interactions and
fail to capture the semantic relation between words and characters. To address
the above issues, we propose a simple yet effective PLM CLOWER, which adopts
the Contrastive Learning Over Word and charactER representations. In
particular, CLOWER implicitly encodes the coarse-grained information (i.e.,
words) into the fine-grained representations (i.e., characters) through
contrastive learning on multi-grained information. CLOWER is of great value in
realistic scenarios since it can be easily incorporated into any existing
fine-grained based PLMs without modifying the production pipelines.Extensive
experiments conducted on a range of downstream tasks demonstrate the superior
performance of CLOWER over several state-of-the-art baselines.
Related papers
- Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - Character, Word, or Both? Revisiting the Segmentation Granularity for
Chinese Pre-trained Language Models [42.75756994523378]
We propose a mixedgranularity Chinese BERT (MigBERT) by considering both characters and words.
We conduct extensive experiments on various Chinese NLP tasks to evaluate existing PLMs as well as the proposed MigBERT.
MigBERT achieves new SOTA performance on all these tasks.
arXiv Detail & Related papers (2023-03-20T06:20:03Z) - Exploiting Word Semantics to Enrich Character Representations of Chinese
Pre-trained Models [12.0190584907439]
We propose a new method to exploit word structure and integrate lexical semantics into character representations of pre-trained models.
We show that our approach achieves superior performance over the basic pre-trained models BERT, BERT-wwm and ERNIE on different Chinese NLP tasks.
arXiv Detail & Related papers (2022-07-13T02:28:08Z) - LICHEE: Improving Language Model Pre-training with Multi-grained
Tokenization [19.89228774074371]
We propose a simple yet effective pre-training method named LICHEE to efficiently incorporate multi-grained information of input text.
Our method can be applied to various pre-trained language models and improve their representation capability.
arXiv Detail & Related papers (2021-08-02T12:08:19Z) - Understanding Chinese Video and Language via Contrastive Multimodal
Pre-Training [79.88705563918413]
We propose a novel video-language understanding framework named VICTOR, which stands for VIdeo-language understanding via Contrastive mulTimOdal pRe-training.
VICTOR is trained on a large-scale Chinese video-language dataset, including over 10 million complete videos with corresponding high-quality textual descriptions.
arXiv Detail & Related papers (2021-04-19T15:58:45Z) - Constrained Language Models Yield Few-Shot Semantic Parsers [73.50960967598654]
We explore the use of large pretrained language models as few-shot semantics.
The goal in semantic parsing is to generate a structured meaning representation given a natural language input.
We use language models to paraphrase inputs into a controlled sublanguage resembling English that can be automatically mapped to a target meaning representation.
arXiv Detail & Related papers (2021-04-18T08:13:06Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - MVP-BERT: Redesigning Vocabularies for Chinese BERT and Multi-Vocab
Pretraining [5.503321733964237]
We first propose a novel method, emphseg_tok, to form the vocabulary of Chinese BERT, with the help of Chinese word segmentation (CWS) and subword tokenization.
Experiments show that emphseg_tok does not only improves the performances of Chinese PLMs on sentence level tasks, it can also improve efficiency.
arXiv Detail & Related papers (2020-11-17T10:15:36Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - AMBERT: A Pre-trained Language Model with Multi-Grained Tokenization [13.082435183692393]
We propose a novel pre-trained language model, referred to as AMBERT (A Multi-grained BERT)
For English, AMBERT takes both the sequence of words (fine-grained tokens) and the sequence of phrases (coarse-grained tokens) as input after tokenization.
Experiments have been conducted on benchmark datasets for Chinese and English, including CLUE, GLUE, SQuAD and RACE.
arXiv Detail & Related papers (2020-08-27T00:23:48Z) - Byte Pair Encoding is Suboptimal for Language Model Pretraining [49.30780227162387]
We analyze differences between unigram LM tokenization and byte-pair encoding (BPE)
We find that the unigram LM tokenization method matches or outperforms BPE across downstream tasks and two languages.
We hope that developers of future pretrained LMs will consider adopting the unigram LM method over the more prevalent BPE.
arXiv Detail & Related papers (2020-04-07T21:21:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.