Improving Robustness and Generality of NLP Models Using Disentangled
Representations
- URL: http://arxiv.org/abs/2009.09587v1
- Date: Mon, 21 Sep 2020 02:48:46 GMT
- Title: Improving Robustness and Generality of NLP Models Using Disentangled
Representations
- Authors: Jiawei Wu, Xiaoya Li, Xiang Ao, Yuxian Meng, Fei Wu and Jiwei Li
- Abstract summary: Supervised neural networks first map an input $x$ to a single representation $z$, and then map $z$ to the output label $y$.
We present methods to improve robustness and generality of NLP models from the standpoint of disentangled representation learning.
We show that models trained with the proposed criteria provide better robustness and domain adaptation ability in a wide range of supervised learning tasks.
- Score: 62.08794500431367
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Supervised neural networks, which first map an input $x$ to a single
representation $z$, and then map $z$ to the output label $y$, have achieved
remarkable success in a wide range of natural language processing (NLP) tasks.
Despite their success, neural models lack for both robustness and generality:
small perturbations to inputs can result in absolutely different outputs; the
performance of a model trained on one domain drops drastically when tested on
another domain.
In this paper, we present methods to improve robustness and generality of NLP
models from the standpoint of disentangled representation learning. Instead of
mapping $x$ to a single representation $z$, the proposed strategy maps $x$ to a
set of representations $\{z_1,z_2,...,z_K\}$ while forcing them to be
disentangled. These representations are then mapped to different logits $l$s,
the ensemble of which is used to make the final prediction $y$. We propose
different methods to incorporate this idea into currently widely-used models,
including adding an $L$2 regularizer on $z$s or adding Total Correlation (TC)
under the framework of variational information bottleneck (VIB). We show that
models trained with the proposed criteria provide better robustness and domain
adaptation ability in a wide range of supervised learning tasks.
Related papers
- IT$^3$: Idempotent Test-Time Training [95.78053599609044]
This paper introduces Idempotent Test-Time Training (IT$3$), a novel approach to addressing the challenge of distribution shift.
IT$3$ is based on the universal property of idempotence.
We demonstrate the versatility of our approach across various tasks, including corrupted image classification.
arXiv Detail & Related papers (2024-10-05T15:39:51Z) - Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models [22.425339110551743]
We introduce $textitweak-to-strong search, framing the alignment of a large language model as a test-time greedy search.
In controlled-sentiment generation and summarization, we use tuned and untuned $textttgpt2$s to improve the alignment of large models without additional training.
In a more difficult instruction-following benchmark, we show that reusing off-the-shelf small models can improve the length-controlled win rates of both white-box and black-box large models.
arXiv Detail & Related papers (2024-05-29T16:55:32Z) - One-shot Active Learning Based on Lewis Weight Sampling for Multiple Deep Models [39.582100727546816]
Active learning (AL) for multiple target models aims to reduce labeled data querying while effectively training multiple models concurrently.
Existing AL algorithms often rely on iterative model training, which can be computationally expensive.
We propose a one-shot AL method to address this challenge, which performs all label queries without repeated model training.
arXiv Detail & Related papers (2024-05-23T02:48:16Z) - Supervised Contrastive Prototype Learning: Augmentation Free Robust
Neural Network [17.10753224600936]
Transformations in the input space of Deep Neural Networks (DNN) lead to unintended changes in the feature space.
We propose a training framework, $textbfd Contrastive Prototype Learning$ ( SCPL)
We use N-pair contrastive loss with prototypes of the same and opposite classes and replace a categorical classification head with a $textbfPrototype Classification Head$ (PCH)
Our approach is $textitsample efficient$, does not require $textitsample mining$, can be implemented on any existing DNN without modification to their
arXiv Detail & Related papers (2022-11-26T01:17:15Z) - Residual Learning of Neural Text Generation with $n$-gram Language Model [41.26228768053928]
We learn a neural LM that fits the residual between an $n$-gram LM and the real-data distribution.
Our approach attains additional performance gains over popular standalone neural models consistently.
arXiv Detail & Related papers (2022-10-26T02:42:53Z) - Towards Alternative Techniques for Improving Adversarial Robustness:
Analysis of Adversarial Training at a Spectrum of Perturbations [5.18694590238069]
Adversarial training (AT) and its variants have spearheaded progress in improving neural network robustness to adversarial perturbations.
We focus on models, trained on a spectrum of $epsilon$ values.
We identify alternative improvements to AT that otherwise wouldn't have been apparent at a single $epsilon$.
arXiv Detail & Related papers (2022-06-13T22:01:21Z) - On the Power of Multitask Representation Learning in Linear MDP [61.58929164172968]
This paper presents analyses for the statistical benefit of multitask representation learning in linear Markov Decision Process (MDP)
We first discover a emphLeast-Activated-Feature-Abundance (LAFA) criterion, denoted as $kappa$, with which we prove that a straightforward least-square algorithm learns a policy which is $tildeO(H2sqrtfrackappa mathcalC(Phi)2 kappa dNT+frackappa dn)
arXiv Detail & Related papers (2021-06-15T11:21:06Z) - On the Theory of Transfer Learning: The Importance of Task Diversity [114.656572506859]
We consider $t+1$ tasks parameterized by functions of the form $f_j circ h$ in a general function class $mathcalF circ mathcalH$.
We show that for diverse training tasks the sample complexity needed to learn the shared representation across the first $t$ training tasks scales as $C(mathcalH) + t C(mathcalF)$.
arXiv Detail & Related papers (2020-06-20T20:33:59Z) - Few-Shot Learning via Learning the Representation, Provably [115.7367053639605]
This paper studies few-shot learning via representation learning.
One uses $T$ source tasks with $n_1$ data per task to learn a representation in order to reduce the sample complexity of a target task.
arXiv Detail & Related papers (2020-02-21T17:30:00Z) - LAVA NAT: A Non-Autoregressive Translation Model with Look-Around
Decoding and Vocabulary Attention [54.18121922040521]
Non-autoregressive translation (NAT) models generate multiple tokens in one forward pass.
These NAT models often suffer from the multimodality problem, generating duplicated tokens or missing tokens.
We propose two novel methods to address this issue, the Look-Around (LA) strategy and the Vocabulary Attention (VA) mechanism.
arXiv Detail & Related papers (2020-02-08T04:11:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.