LET: Linguistic Knowledge Enhanced Graph Transformer for Chinese Short
Text Matching
- URL: http://arxiv.org/abs/2102.12671v1
- Date: Thu, 25 Feb 2021 04:01:51 GMT
- Title: LET: Linguistic Knowledge Enhanced Graph Transformer for Chinese Short
Text Matching
- Authors: Boer Lyu, Lu Chen, Su Zhu, Kai Yu
- Abstract summary: We introduce HowNet as an external knowledge base and propose a Linguistic knowledge Enhanced graph Transformer (LET) to deal with word ambiguity.
Experimental results on two Chinese datasets show that our models outperform various typical text matching approaches.
- Score: 29.318730227080675
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Chinese short text matching is a fundamental task in natural language
processing. Existing approaches usually take Chinese characters or words as
input tokens. They have two limitations: 1) Some Chinese words are polysemous,
and semantic information is not fully utilized. 2) Some models suffer potential
issues caused by word segmentation. Here we introduce HowNet as an external
knowledge base and propose a Linguistic knowledge Enhanced graph Transformer
(LET) to deal with word ambiguity. Additionally, we adopt the word lattice
graph as input to maintain multi-granularity information. Our model is also
complementary to pre-trained language models. Experimental results on two
Chinese datasets show that our models outperform various typical text matching
approaches. Ablation study also indicates that both semantic information and
multi-granularity information are important for text matching modeling.
Related papers
- Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - Character, Word, or Both? Revisiting the Segmentation Granularity for
Chinese Pre-trained Language Models [42.75756994523378]
We propose a mixedgranularity Chinese BERT (MigBERT) by considering both characters and words.
We conduct extensive experiments on various Chinese NLP tasks to evaluate existing PLMs as well as the proposed MigBERT.
MigBERT achieves new SOTA performance on all these tasks.
arXiv Detail & Related papers (2023-03-20T06:20:03Z) - CLOWER: A Pre-trained Language Model with Contrastive Learning over Word
and Character Representations [18.780841483220986]
Pre-trained Language Models (PLMs) have achieved remarkable performance gains across numerous downstream tasks in natural language understanding.
Most current models use Chinese characters as inputs and are not able to encode semantic information contained in Chinese words.
We propose a simple yet effective PLM CLOWER, which adopts the Contrastive Learning Over Word and charactER representations.
arXiv Detail & Related papers (2022-08-23T09:52:34Z) - Exploiting Word Semantics to Enrich Character Representations of Chinese
Pre-trained Models [12.0190584907439]
We propose a new method to exploit word structure and integrate lexical semantics into character representations of pre-trained models.
We show that our approach achieves superior performance over the basic pre-trained models BERT, BERT-wwm and ERNIE on different Chinese NLP tasks.
arXiv Detail & Related papers (2022-07-13T02:28:08Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin
Information [32.70080326854314]
We propose ChineseBERT, which incorporates the glyph and pinyin information of Chinese characters into language model pretraining.
The proposed ChineseBERT model yields significant performance boost over baseline models with fewer training steps.
arXiv Detail & Related papers (2021-06-30T13:06:00Z) - SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language
Model Pretraining [48.880840711568425]
We study the influences of three main factors on the Chinese tokenization for pretrained language models.
We propose three kinds of tokenizers: SHUOWEN (meaning Talk Word), the pronunciation-based tokenizers; 2) JIEZI (meaning Solve Character), the glyph-based tokenizers.
We find that SHUOWEN and JIEZI tokenizers can generally outperform conventional single-character tokenizers.
arXiv Detail & Related papers (2021-06-01T11:20:02Z) - Lattice-BERT: Leveraging Multi-Granularity Representations in Chinese
Pre-trained Language Models [62.41139712595334]
We propose a novel pre-training paradigm for Chinese -- Lattice-BERT.
We construct a lattice graph from the characters and words in a sentence and feed all these text units into transformers.
We show that our model can bring an average increase of 1.5% under the 12-layer setting.
arXiv Detail & Related papers (2021-04-15T02:36:49Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - 2kenize: Tying Subword Sequences for Chinese Script Conversion [54.33749520569979]
We propose a model that can disambiguate between mappings and convert between the two scripts.
Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy.
arXiv Detail & Related papers (2020-05-07T10:53:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.