Related papers: On the Robustness of Text Vectorizers

On the Robustness of Text Vectorizers

URL: http://arxiv.org/abs/2303.07203v2
Date: Mon, 12 Jun 2023 12:55:48 GMT
Title: On the Robustness of Text Vectorizers
Authors: R\'emi Catellier, Samuel Vaiter, Damien Garreau
Abstract summary: In natural language processing, models typically contain a first embedding layer, transforming a sequence of tokens into vector representations. While the robustness with respect to changes of continuous inputs is well-understood, the situation is less clear when considering discrete changes. Our work formally proves that popular embedding schemes, such as concatenation, TF-IDF, and paragraph Vector (a.k.a. doc2vec), exhibit robustness in the H"older or Lipschitz sense with respect to the Hamming distance.
Score: 9.904746542801838
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A fundamental issue in machine learning is the robustness of the model with respect to changes in the input. In natural language processing, models typically contain a first embedding layer, transforming a sequence of tokens into vector representations. While the robustness with respect to changes of continuous inputs is well-understood, the situation is less clear when considering discrete changes, for instance replacing a word by another in an input sentence. Our work formally proves that popular embedding schemes, such as concatenation, TF-IDF, and Paragraph Vector (a.k.a. doc2vec), exhibit robustness in the H\"older or Lipschitz sense with respect to the Hamming distance. We provide quantitative bounds for these schemes and demonstrate how the constants involved are affected by the length of the document. These findings are exemplified through a series of numerical examples.

Related papers

Autoregressive Speech Synthesis without Vector Quantization [135.4776759536272]
We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS) MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition.
arXiv Detail & Related papers (2024-07-11T14:36:53Z)
On Adversarial Examples for Text Classification by Perturbing Latent Representations [0.0]
We show that deep learning is vulnerable to adversarial examples in text classification. This weakness indicates that deep learning is not very robust. We create a framework that measures the robustness of a text classifier by using the gradients of the classifier.
arXiv Detail & Related papers (2024-05-06T18:45:18Z)
ReAGent: A Model-agnostic Feature Attribution Method for Generative Language Models [4.015810081063028]
Feature attribution methods (FAs) are employed to derive the importance of all input features to the model predictions. It is unknown if it is faithful to use these FAs for decoder-only models on text generation. We present a model-agnostic FA for generative LMs called Recursive Attribution Generator (ReAGent)
arXiv Detail & Related papers (2024-02-01T17:25:51Z)
LEA: Improving Sentence Similarity Robustness to Typos Using Lexical Attention Bias [3.48350302245205]
Textual noise, such as typos or abbreviations, penalizes vanilla Transformers for most downstream tasks. We show that this is also the case for sentence similarity, a fundamental task in multiple domains. We propose to tackle textual noise by equipping cross-encoders with a novel LExical-aware Attention module.
arXiv Detail & Related papers (2023-07-06T10:53:50Z)
Compositional Generalization without Trees using Multiset Tagging and Latent Permutations [121.37328648951993]
We phrase semantic parsing as a two-step process: we first tag each input token with a multiset of output tokens. Then we arrange the tokens into an output sequence using a new way of parameterizing and predicting permutations. Our model outperforms pretrained seq2seq models and prior work on realistic semantic parsing tasks.
arXiv Detail & Related papers (2023-05-26T14:09:35Z)
Sentence Embedding Leaks More Information than You Expect: Generative Embedding Inversion Attack to Recover the Whole Sentence [37.63047048491312]
We propose a generative embedding inversion attack (GEIA) that aims to reconstruct input sequences based only on their sentence embeddings. Given the black-box access to a language model, we treat sentence embeddings as initial tokens' representations and train or fine-tune a powerful decoder model to decode the whole sequences directly.
arXiv Detail & Related papers (2023-05-04T17:31:41Z)
Same or Different? Diff-Vectors for Authorship Analysis [78.83284164605473]
In classic'' authorship analysis a feature vector represents a document, the value of a feature represents (an increasing function of) the relative frequency of the feature in the document, and the class label represents the author of the document. Our experiments tackle same-author verification, authorship verification, and closed-set authorship attribution; while DVs are naturally geared for solving the 1st, we also provide two novel methods for solving the 2nd and 3rd.
arXiv Detail & Related papers (2023-01-24T08:48:12Z)
Bypass Network for Semantics Driven Image Paragraph Captioning [12.743882133781602]
Image paragraph captioning aims to describe a given image with a sequence of coherent sentences. Most existing methods model the coherence through the topic transition that dynamically infers a topic vector from preceding sentences. We propose a bypass network that separately models semantics and linguistic syntax of preceding sentences.
arXiv Detail & Related papers (2022-06-21T00:48:22Z)
Quark: Controllable Text Generation with Reinforced Unlearning [68.07749519374089]
Large-scale language models often learn behaviors that are misaligned with user expectations. We introduce Quantized Reward Konditioning (Quark), an algorithm for optimizing a reward function that quantifies an (un)wanted property. For unlearning toxicity, negative sentiment, and repetition, our experiments show that Quark outperforms both strong baselines and state-of-the-art reinforcement learning methods.
arXiv Detail & Related papers (2022-05-26T21:11:51Z)
Unnatural Language Inference [48.45003475966808]
We find that state-of-the-art NLI models, such as RoBERTa and BART, are invariant to, and sometimes even perform better on, examples with randomly reordered words. Our findings call into question the idea that our natural language understanding models, and the tasks used for measuring their progress, genuinely require a human-like understanding of syntax.
arXiv Detail & Related papers (2020-12-30T20:40:48Z)
Improve Variational Autoencoder for Text Generationwith Discrete Latent Bottleneck [52.08901549360262]
Variational autoencoders (VAEs) are essential tools in end-to-end representation learning. VAEs tend to ignore latent variables with a strong auto-regressive decoder. We propose a principled approach to enforce an implicit latent feature matching in a more compact latent space.
arXiv Detail & Related papers (2020-04-22T14:41:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.