On the Robustness of Text Vectorizers
- URL: http://arxiv.org/abs/2303.07203v2
- Date: Mon, 12 Jun 2023 12:55:48 GMT
- Title: On the Robustness of Text Vectorizers
- Authors: R\'emi Catellier, Samuel Vaiter, Damien Garreau
- Abstract summary: In natural language processing, models typically contain a first embedding layer, transforming a sequence of tokens into vector representations.
While the robustness with respect to changes of continuous inputs is well-understood, the situation is less clear when considering discrete changes.
Our work formally proves that popular embedding schemes, such as concatenation, TF-IDF, and paragraph Vector (a.k.a. doc2vec), exhibit robustness in the H"older or Lipschitz sense with respect to the Hamming distance.
- Score: 9.904746542801838
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A fundamental issue in machine learning is the robustness of the model with
respect to changes in the input. In natural language processing, models
typically contain a first embedding layer, transforming a sequence of tokens
into vector representations. While the robustness with respect to changes of
continuous inputs is well-understood, the situation is less clear when
considering discrete changes, for instance replacing a word by another in an
input sentence. Our work formally proves that popular embedding schemes, such
as concatenation, TF-IDF, and Paragraph Vector (a.k.a. doc2vec), exhibit
robustness in the H\"older or Lipschitz sense with respect to the Hamming
distance. We provide quantitative bounds for these schemes and demonstrate how
the constants involved are affected by the length of the document. These
findings are exemplified through a series of numerical examples.
Related papers
- Autoregressive Speech Synthesis without Vector Quantization [135.4776759536272]
We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS)
MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition.
arXiv Detail & Related papers (2024-07-11T14:36:53Z) - On Adversarial Examples for Text Classification by Perturbing Latent Representations [0.0]
We show that deep learning is vulnerable to adversarial examples in text classification.
This weakness indicates that deep learning is not very robust.
We create a framework that measures the robustness of a text classifier by using the gradients of the classifier.
arXiv Detail & Related papers (2024-05-06T18:45:18Z) - ReAGent: A Model-agnostic Feature Attribution Method for Generative
Language Models [4.015810081063028]
Feature attribution methods (FAs) are employed to derive the importance of all input features to the model predictions.
It is unknown if it is faithful to use these FAs for decoder-only models on text generation.
We present a model-agnostic FA for generative LMs called Recursive Attribution Generator (ReAGent)
arXiv Detail & Related papers (2024-02-01T17:25:51Z) - LEA: Improving Sentence Similarity Robustness to Typos Using Lexical
Attention Bias [3.48350302245205]
Textual noise, such as typos or abbreviations, penalizes vanilla Transformers for most downstream tasks.
We show that this is also the case for sentence similarity, a fundamental task in multiple domains.
We propose to tackle textual noise by equipping cross-encoders with a novel LExical-aware Attention module.
arXiv Detail & Related papers (2023-07-06T10:53:50Z) - Compositional Generalization without Trees using Multiset Tagging and
Latent Permutations [121.37328648951993]
We phrase semantic parsing as a two-step process: we first tag each input token with a multiset of output tokens.
Then we arrange the tokens into an output sequence using a new way of parameterizing and predicting permutations.
Our model outperforms pretrained seq2seq models and prior work on realistic semantic parsing tasks.
arXiv Detail & Related papers (2023-05-26T14:09:35Z) - Sentence Embedding Leaks More Information than You Expect: Generative
Embedding Inversion Attack to Recover the Whole Sentence [37.63047048491312]
We propose a generative embedding inversion attack (GEIA) that aims to reconstruct input sequences based only on their sentence embeddings.
Given the black-box access to a language model, we treat sentence embeddings as initial tokens' representations and train or fine-tune a powerful decoder model to decode the whole sequences directly.
arXiv Detail & Related papers (2023-05-04T17:31:41Z) - Same or Different? Diff-Vectors for Authorship Analysis [78.83284164605473]
In classic'' authorship analysis a feature vector represents a document, the value of a feature represents (an increasing function of) the relative frequency of the feature in the document, and the class label represents the author of the document.
Our experiments tackle same-author verification, authorship verification, and closed-set authorship attribution; while DVs are naturally geared for solving the 1st, we also provide two novel methods for solving the 2nd and 3rd.
arXiv Detail & Related papers (2023-01-24T08:48:12Z) - Bypass Network for Semantics Driven Image Paragraph Captioning [12.743882133781602]
Image paragraph captioning aims to describe a given image with a sequence of coherent sentences.
Most existing methods model the coherence through the topic transition that dynamically infers a topic vector from preceding sentences.
We propose a bypass network that separately models semantics and linguistic syntax of preceding sentences.
arXiv Detail & Related papers (2022-06-21T00:48:22Z) - Quark: Controllable Text Generation with Reinforced Unlearning [68.07749519374089]
Large-scale language models often learn behaviors that are misaligned with user expectations.
We introduce Quantized Reward Konditioning (Quark), an algorithm for optimizing a reward function that quantifies an (un)wanted property.
For unlearning toxicity, negative sentiment, and repetition, our experiments show that Quark outperforms both strong baselines and state-of-the-art reinforcement learning methods.
arXiv Detail & Related papers (2022-05-26T21:11:51Z) - Unnatural Language Inference [48.45003475966808]
We find that state-of-the-art NLI models, such as RoBERTa and BART, are invariant to, and sometimes even perform better on, examples with randomly reordered words.
Our findings call into question the idea that our natural language understanding models, and the tasks used for measuring their progress, genuinely require a human-like understanding of syntax.
arXiv Detail & Related papers (2020-12-30T20:40:48Z) - Improve Variational Autoencoder for Text Generationwith Discrete Latent
Bottleneck [52.08901549360262]
Variational autoencoders (VAEs) are essential tools in end-to-end representation learning.
VAEs tend to ignore latent variables with a strong auto-regressive decoder.
We propose a principled approach to enforce an implicit latent feature matching in a more compact latent space.
arXiv Detail & Related papers (2020-04-22T14:41:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.