Linguistic Knowledge in Data Augmentation for Natural Language
Processing: An Example on Chinese Question Matching
- URL: http://arxiv.org/abs/2111.14709v1
- Date: Mon, 29 Nov 2021 17:07:49 GMT
- Title: Linguistic Knowledge in Data Augmentation for Natural Language
Processing: An Example on Chinese Question Matching
- Authors: Zhengxiang Wang
- Abstract summary: Two DA programs produce augmented texts by five simple text editing operations.
One is enhanced with a n-gram language model to make it fused with extra linguistic knowledge.
Models trained on both types of the augmented trained sets were found to be outperformed by those directly trained on the associated un-augmented train sets.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data augmentation (DA) is a common solution to data scarcity and imbalance
problems, which is an area getting increasing attentions from the Natural
Language Processing (NLP) community. While various DA techniques have been used
in NLP research, little is known about the role of linguistic knowledge in DA
for NLP; in particular, whether more linguistic knowledge leads to a better DA
approach. To investigate that, we designed two adapted DA programs and applied
them to LCQMC (a Large-scale Chinese Question Matching Corpus) for a binary
Chinese question matching classification task. The two DA programs produce
augmented texts by five simple text editing operations, largely irrespective of
language generation rules, but one is enhanced with a n-gram language model to
make it fused with extra linguistic knowledge. We then trained four neural
network models and a pre-trained model on the LCQMC train sets of varying size
as well as the corresponding augmented trained sets produced by the two DA
programs. The test set performances of the five classification models show that
adding probabilistic linguistic knowledge as constrains does not make the base
DA program better, since there are no discernible performance differences
between the models trained on the two types of augmented train sets. Instead,
since the added linguistic knowledge decreases the diversity of the augmented
texts, the trained models generalizability is hampered. Moreover, models
trained on both types of the augmented trained sets were found to be
outperformed by those directly trained on the associated un-augmented train
sets, due to the inability of the underlying text editing operations to make
paraphrastic augmented texts. We concluded that the validity and diversity of
the augmented texts are two important factors for a DA approach or technique to
be effective and proposed a possible paradigm shift for text augmentation.
Related papers
- VECO 2.0: Cross-lingual Language Model Pre-training with
Multi-granularity Contrastive Learning [56.47303426167584]
We propose a cross-lingual pre-trained model VECO2.0 based on contrastive learning with multi-granularity alignments.
Specifically, the sequence-to-sequence alignment is induced to maximize the similarity of the parallel pairs and minimize the non-parallel pairs.
token-to-token alignment is integrated to bridge the gap between synonymous tokens excavated via the thesaurus dictionary from the other unpaired tokens in a bilingual instance.
arXiv Detail & Related papers (2023-04-17T12:23:41Z) - A Cohesive Distillation Architecture for Neural Language Models [0.0]
A recent trend in Natural Language Processing is the exponential growth in Language Model (LM) size.
This study investigates methods for Knowledge Distillation (KD) to provide efficient alternatives to large-scale models.
arXiv Detail & Related papers (2023-01-12T08:01:53Z) - Revisiting and Advancing Chinese Natural Language Understanding with
Accelerated Heterogeneous Knowledge Pre-training [25.510288465345592]
Unlike English, there is a lack of high-performing open-source Chinese KEPLMs in the natural language processing (NLP) community to support various language understanding applications.
Here, we revisit and advance the development of Chinese natural language understanding with a series of novel Chinese KEPLMs released in various parameter sizes.
Specifically, both relational and linguistic knowledge is effectively injected into CKBERT based on two novel pre-training tasks.
arXiv Detail & Related papers (2022-10-11T09:34:21Z) - Scheduled Multi-task Learning for Neural Chat Translation [66.81525961469494]
We propose a scheduled multi-task learning framework for Neural Chat Translation (NCT)
Specifically, we devise a three-stage training framework to incorporate the large-scale in-domain chat translation data into training.
Extensive experiments in four language directions verify the effectiveness and superiority of the proposed approach.
arXiv Detail & Related papers (2022-05-08T02:57:28Z) - Learning to Generalize to More: Continuous Semantic Augmentation for
Neural Machine Translation [50.54059385277964]
We present a novel data augmentation paradigm termed Continuous Semantic Augmentation (CsaNMT)
CsaNMT augments each training instance with an adjacency region that could cover adequate variants of literal expression under the same meaning.
arXiv Detail & Related papers (2022-04-14T08:16:28Z) - Towards Generalized Models for Task-oriented Dialogue Modeling on Spoken
Conversations [22.894541507068933]
This paper presents our approach to build generalized models for the Knowledge-grounded Task-oriented Dialogue Modeling on Spoken Conversations Challenge of DSTC-10.
We employ extensive data augmentation strategies on written data, including artificial error injection and round-trip text-speech transformation.
Our approach ranks third on the objective evaluation and second on the final official human evaluation.
arXiv Detail & Related papers (2022-03-08T12:26:57Z) - To Augment or Not to Augment? A Comparative Study on Text Augmentation
Techniques for Low-Resource NLP [0.0]
We investigate three categories of text augmentation methodologies which perform changes on the syntax.
We compare them on part-of-speech tagging, dependency parsing and semantic role labeling for a diverse set of language families.
Our results suggest that the augmentation techniques can further improve over strong baselines based on mBERT.
arXiv Detail & Related papers (2021-11-18T10:52:48Z) - Improving the Lexical Ability of Pretrained Language Models for
Unsupervised Neural Machine Translation [127.81351683335143]
Cross-lingual pretraining requires models to align the lexical- and high-level representations of the two languages.
Previous research has shown that this is because the representations are not sufficiently aligned.
In this paper, we enhance the bilingual masked language model pretraining with lexical-level information by using type-level cross-lingual subword embeddings.
arXiv Detail & Related papers (2021-03-18T21:17:58Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Data Augmentation for Spoken Language Understanding via Pretrained
Language Models [113.56329266325902]
Training of spoken language understanding (SLU) models often faces the problem of data scarcity.
We put forward a data augmentation method using pretrained language models to boost the variability and accuracy of generated utterances.
arXiv Detail & Related papers (2020-04-29T04:07:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.