MixText: Linguistically-Informed Interpolation of Hidden Space for
Semi-Supervised Text Classification
- URL: http://arxiv.org/abs/2004.12239v1
- Date: Sat, 25 Apr 2020 21:37:36 GMT
- Title: MixText: Linguistically-Informed Interpolation of Hidden Space for
Semi-Supervised Text Classification
- Authors: Jiaao Chen, Zichao Yang, Diyi Yang
- Abstract summary: MixText is a semi-supervised learning method for text classification.
TMix creates a large amount of augmented training samples by interpolating text in hidden space.
We leverage recent advances in data augmentation to guess low-entropy labels for unlabeled data.
- Score: 68.15015032551214
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents MixText, a semi-supervised learning method for text
classification, which uses our newly designed data augmentation method called
TMix. TMix creates a large amount of augmented training samples by
interpolating text in hidden space. Moreover, we leverage recent advances in
data augmentation to guess low-entropy labels for unlabeled data, hence making
them as easy to use as labeled data.By mixing labeled, unlabeled and augmented
data, MixText significantly outperformed current pre-trained and fined-tuned
models and other state-of-the-art semi-supervised learning methods on several
text classification benchmarks. The improvement is especially prominent when
supervision is extremely limited. We have publicly released our code at
https://github.com/GT-SALT/MixText.
Related papers
- Like a Good Nearest Neighbor: Practical Content Moderation and Text
Classification [66.02091763340094]
Like a Good Nearest Neighbor (LaGoNN) is a modification to SetFit that introduces no learnable parameters but alters input text with information from its nearest neighbor.
LaGoNN is effective at flagging undesirable content and text classification, and improves the performance of SetFit.
arXiv Detail & Related papers (2023-02-17T15:43:29Z) - SelfMix: Robust Learning Against Textual Label Noise with Self-Mixup
Training [15.877178854064708]
SelfMix is a simple yet effective method to handle label noise in text classification tasks.
Our method utilizes the dropout mechanism on a single model to reduce the confirmation bias in self-training.
arXiv Detail & Related papers (2022-10-10T09:46:40Z) - DoubleMix: Simple Interpolation-Based Data Augmentation for Text
Classification [56.817386699291305]
This paper proposes a simple yet effective data augmentation approach termed DoubleMix.
DoubleMix first generates several perturbed samples for each training data.
It then uses the perturbed data and original data to carry out a two-step in the hidden space of neural models.
arXiv Detail & Related papers (2022-09-12T15:01:04Z) - Swapping Semantic Contents for Mixing Images [44.0283695495163]
Mixing Data Augmentations do not typically yield new labeled samples, as indiscriminately mixing contents creates between-class samples.
We introduce the SciMix framework that can learn to generator to embed a semantic style code into image backgrounds.
We demonstrate that SciMix yields novel mixed samples that inherit many characteristics from their non-semantic parents.
arXiv Detail & Related papers (2022-05-20T13:07:27Z) - GUDN A novel guide network for extreme multi-label text classification [12.975260278131078]
This paper constructs a novel guide network (GUDN) to help fine-tune the pre-trained model to instruct classification later.
We also use the raw label semantics to effectively explore the latent space between texts and labels, which can further improve predicted accuracy.
arXiv Detail & Related papers (2022-01-10T07:33:36Z) - GuidedMix-Net: Learning to Improve Pseudo Masks Using Labeled Images as
Reference [153.354332374204]
We propose a novel method for semi-supervised semantic segmentation named GuidedMix-Net.
We first introduce a feature alignment objective between labeled and unlabeled data to capture potentially similar image pairs.
MITrans is shown to be a powerful knowledge module for further progressive refining features of unlabeled data.
Along with supervised learning for labeled data, the prediction of unlabeled data is jointly learned with the generated pseudo masks.
arXiv Detail & Related papers (2021-06-29T02:48:45Z) - SSMix: Saliency-Based Span Mixup for Text Classification [2.4493299476776778]
We propose SSMix, a novel mixup method where the operation is performed on input text rather than on hidden vectors.
SSMix synthesizes a sentence while preserving the locality of two original texts by span-based mixing.
We empirically validate that our method outperforms hidden-level mixup methods on a wide range of text classification benchmarks.
arXiv Detail & Related papers (2021-06-15T11:40:23Z) - Scene Text Detection with Scribble Lines [59.698806258671105]
We propose to annotate texts by scribble lines instead of polygons for text detection.
It is a general labeling method for texts with various shapes and requires low labeling costs.
Experiments show that the proposed method bridges the performance gap between the weakly labeling method and the original polygon-based labeling methods.
arXiv Detail & Related papers (2020-12-09T13:14:53Z) - DivideMix: Learning with Noisy Labels as Semi-supervised Learning [111.03364864022261]
We propose DivideMix, a framework for learning with noisy labels.
Experiments on multiple benchmark datasets demonstrate substantial improvements over state-of-the-art methods.
arXiv Detail & Related papers (2020-02-18T06:20:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.