LEA: Improving Sentence Similarity Robustness to Typos Using Lexical
Attention Bias
- URL: http://arxiv.org/abs/2307.02912v1
- Date: Thu, 6 Jul 2023 10:53:50 GMT
- Title: LEA: Improving Sentence Similarity Robustness to Typos Using Lexical
Attention Bias
- Authors: Mario Almagro, Emilio Almaz\'an, Diego Ortego, David Jim\'enez
- Abstract summary: Textual noise, such as typos or abbreviations, penalizes vanilla Transformers for most downstream tasks.
We show that this is also the case for sentence similarity, a fundamental task in multiple domains.
We propose to tackle textual noise by equipping cross-encoders with a novel LExical-aware Attention module.
- Score: 3.48350302245205
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Textual noise, such as typos or abbreviations, is a well-known issue that
penalizes vanilla Transformers for most downstream tasks. We show that this is
also the case for sentence similarity, a fundamental task in multiple domains,
e.g. matching, retrieval or paraphrasing. Sentence similarity can be approached
using cross-encoders, where the two sentences are concatenated in the input
allowing the model to exploit the inter-relations between them. Previous works
addressing the noise issue mainly rely on data augmentation strategies, showing
improved robustness when dealing with corrupted samples that are similar to the
ones used for training. However, all these methods still suffer from the token
distribution shift induced by typos. In this work, we propose to tackle textual
noise by equipping cross-encoders with a novel LExical-aware Attention module
(LEA) that incorporates lexical similarities between words in both sentences.
By using raw text similarities, our approach avoids the tokenization shift
problem obtaining improved robustness. We demonstrate that the attention bias
introduced by LEA helps cross-encoders to tackle complex scenarios with textual
noise, specially in domains with short-text descriptions and limited context.
Experiments using three popular Transformer encoders in five e-commerce
datasets for product matching show that LEA consistently boosts performance
under the presence of noise, while remaining competitive on the original
(clean) splits. We also evaluate our approach in two datasets for textual
entailment and paraphrasing showing that LEA is robust to typos in domains with
longer sentences and more natural context. Additionally, we thoroughly analyze
several design choices in our approach, providing insights about the impact of
the decisions made and fostering future research in cross-encoders dealing with
typos.
Related papers
- DenoSent: A Denoising Objective for Self-Supervised Sentence
Representation Learning [59.4644086610381]
We propose a novel denoising objective that inherits from another perspective, i.e., the intra-sentence perspective.
By introducing both discrete and continuous noise, we generate noisy sentences and then train our model to restore them to their original form.
Our empirical evaluations demonstrate that this approach delivers competitive results on both semantic textual similarity (STS) and a wide range of transfer tasks.
arXiv Detail & Related papers (2024-01-24T17:48:45Z) - SenTest: Evaluating Robustness of Sentence Encoders [0.4194295877935868]
This work focuses on evaluating the robustness of the sentence encoders.
We employ several adversarial attacks to evaluate its robustness.
The results of the experiments strongly undermine the robustness of sentence encoders.
arXiv Detail & Related papers (2023-11-29T15:21:35Z) - RegaVAE: A Retrieval-Augmented Gaussian Mixture Variational Auto-Encoder
for Language Modeling [79.56442336234221]
We introduce RegaVAE, a retrieval-augmented language model built upon the variational auto-encoder (VAE)
It encodes the text corpus into a latent space, capturing current and future information from both source and target text.
Experimental results on various datasets demonstrate significant improvements in text generation quality and hallucination removal.
arXiv Detail & Related papers (2023-10-16T16:42:01Z) - Improving the Robustness of Summarization Systems with Dual Augmentation [68.53139002203118]
A robust summarization system should be able to capture the gist of the document, regardless of the specific word choices or noise in the input.
We first explore the summarization models' robustness against perturbations including word-level synonym substitution and noise.
We propose a SummAttacker, which is an efficient approach to generating adversarial samples based on language models.
arXiv Detail & Related papers (2023-06-01T19:04:17Z) - On the Robustness of Text Vectorizers [9.904746542801838]
In natural language processing, models typically contain a first embedding layer, transforming a sequence of tokens into vector representations.
While the robustness with respect to changes of continuous inputs is well-understood, the situation is less clear when considering discrete changes.
Our work formally proves that popular embedding schemes, such as concatenation, TF-IDF, and paragraph Vector (a.k.a. doc2vec), exhibit robustness in the H"older or Lipschitz sense with respect to the Hamming distance.
arXiv Detail & Related papers (2023-03-09T16:37:37Z) - Non-Linguistic Supervision for Contrastive Learning of Sentence
Embeddings [14.244787327283335]
We find the performance of Transformer models as sentence encoders can be improved by training with multi-modal multi-task losses.
The reliance of our framework on unpaired non-linguistic data makes it language-agnostic, enabling it to be widely applicable beyond English NLP.
arXiv Detail & Related papers (2022-09-20T03:01:45Z) - Unsupervised Mismatch Localization in Cross-Modal Sequential Data [5.932046800902776]
We develop an unsupervised learning algorithm that can infer the relationship between content-mismatched cross-modal data.
We propose a hierarchical Bayesian deep learning model, named mismatch localization variational autoencoder (ML-VAE), that decomposes the generative process of the speech into hierarchically structured latent variables.
Our experimental results show that ML-VAE successfully locates the mismatch between text and speech, without the need for human annotations.
arXiv Detail & Related papers (2022-05-05T14:23:27Z) - Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation.
This paper aims to address the issue with a mask-and-predict strategy.
We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions.
Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z) - Narrative Incoherence Detection [76.43894977558811]
We propose the task of narrative incoherence detection as a new arena for inter-sentential semantic understanding.
Given a multi-sentence narrative, decide whether there exist any semantic discrepancies in the narrative flow.
arXiv Detail & Related papers (2020-12-21T07:18:08Z) - Rethinking Positional Encoding in Language Pre-training [111.2320727291926]
We show that in absolute positional encoding, the addition operation applied on positional embeddings and word embeddings brings mixed correlations.
We propose a new positional encoding method called textbfTransformer with textbfUntied textPositional textbfEncoding (T)
arXiv Detail & Related papers (2020-06-28T13:11:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.