Related papers: Quantifying Positional Biases in Text Embedding Models

Quantifying Positional Biases in Text Embedding Models

URL: http://arxiv.org/abs/2412.15241v3
Date: Wed, 01 Jan 2025 18:06:08 GMT
Title: Quantifying Positional Biases in Text Embedding Models
Authors: Samarth Goel, Reagan J. Lee, Kannan Ramchandran,
Abstract summary: We investigate the impact of content position and input size on text embeddings.<n>Our experiments reveal that embedding models, irrespective of their positional encoding mechanisms, disproportionately prioritize the beginning of an input.
Score: 9.735115681462707
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Embedding models are crucial for tasks in Information Retrieval (IR) and semantic similarity measurement, yet their handling of longer texts and associated positional biases remains underexplored. In this study, we investigate the impact of content position and input size on text embeddings. Our experiments reveal that embedding models, irrespective of their positional encoding mechanisms, disproportionately prioritize the beginning of an input. Ablation studies demonstrate that insertion of irrelevant text or removal at the start of a document reduces cosine similarity between altered and original embeddings by up to 12.3% more than ablations at the end. Regression analysis further confirms this bias, with sentence importance declining as position moves further from the start, even with with content-agnosticity. We hypothesize that this effect arises from pre-processing strategies and chosen positional encoding techniques. These findings quantify the sensitivity of retrieval systems and suggest a new lens towards embedding model robustness.

Related papers

A Dataset for Semantic Segmentation in the Presence of Unknowns [49.795683850385956]
Existing datasets allow evaluation of only knowns or unknowns - but not both. We propose a novel anomaly segmentation dataset, ISSU, that features a diverse set of anomaly inputs from cluttered real-world environments. The dataset is twice larger than existing anomaly segmentation datasets.
arXiv Detail & Related papers (2025-03-28T10:31:01Z)
Downstream-Pretext Domain Knowledge Traceback for Active Learning [138.02530777915362]
We propose a downstream-pretext domain knowledge traceback (DOKT) method that traces the data interactions of downstream knowledge and pre-training guidance. DOKT consists of a traceback diversity indicator and a domain-based uncertainty estimator. Experiments conducted on ten datasets show that our model outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2024-07-20T01:34:13Z)
Towards Understanding Domain Adapted Sentence Embeddings for Document Retrieval [11.695672855244744]
We domain adapt embeddings using telecom, health and science datasets for question answering.<n>We establish a systematic method to obtain thresholds for similarity scores for different embeddings.<n>We show that embeddings for domain-specific sentences have little overlap with those for domain-agnostic ones.
arXiv Detail & Related papers (2024-06-18T07:03:34Z)
A Robust Autoencoder Ensemble-Based Approach for Anomaly Detection in Text [0.3314882635954752]
Textual Anomaly Contamination (TAC) allows to contaminate inlier classes with either independent or contextual anomalies. For finding contextual anomalies, we propose RoSAE, a Robust Subspace Local Recovery Autoencoder Ensemble. Benchmark shows that our approach outperforms recent works on both independent and contextual anomalies, while being more robust.
arXiv Detail & Related papers (2024-05-16T10:45:43Z)
Exploiting Positional Bias for Query-Agnostic Generative Content in Search [24.600506147325717]
We show that non-relevant text can be injected into a document without adversely affecting its position in search results. We find that contextualisation of a non-relevant text further reduces negative effects whilst likely circumventing existing content filtering mechanisms.
arXiv Detail & Related papers (2024-05-01T12:12:59Z)
Text Attribute Control via Closed-Loop Disentanglement [72.2786244367634]
We propose a novel approach to achieve a robust control of attributes while enhancing content preservation. In this paper, we use a semi-supervised contrastive learning method to encourage the disentanglement of attributes in latent spaces. We conducted experiments on three text datasets, including the Yelp Service review dataset, the Amazon Product review dataset, and the GoEmotions dataset.
arXiv Detail & Related papers (2023-12-01T01:26:38Z)
Towards Debiasing Frame Length Bias in Text-Video Retrieval via Causal Intervention [72.12974259966592]
We present a unique and systematic study of a temporal bias due to frame length discrepancy between training and test sets of trimmed video clips. We propose a causal debiasing approach and perform extensive experiments and ablation studies on the Epic-Kitchens-100, YouCook2, and MSR-VTT datasets.
arXiv Detail & Related papers (2023-09-17T15:58:27Z)
Debiasing Stance Detection Models with Counterfactual Reasoning and Adversarial Bias Learning [15.68462203989933]
Stance detection models tend to rely on dataset bias in the text part as a shortcut. We propose an adversarial bias learning module to model the bias more accurately.
arXiv Detail & Related papers (2022-12-20T16:20:56Z)
AES Systems Are Both Overstable And Oversensitive: Explaining Why And Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models. Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models. We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z)
The Sensitivity of Word Embeddings-based Author Detection Models to Semantic-preserving Adversarial Perturbations [3.7552532139404797]
Authorship analysis is an important subject in the field of natural language processing. This paper explores the limitations and sensitiveness of established approaches to adversarial manipulations of inputs.
arXiv Detail & Related papers (2021-02-23T19:55:45Z)
Weakly-Supervised Aspect-Based Sentiment Analysis via Joint Aspect-Sentiment Topic Embedding [71.2260967797055]
We propose a weakly-supervised approach for aspect-based sentiment analysis. We learn sentiment, aspect> joint topic embeddings in the word embedding space. We then use neural models to generalize the word-level discriminative information.
arXiv Detail & Related papers (2020-10-13T21:33:24Z)
An Experimental Study of The Effects of Position Bias on Emotion CauseExtraction [8.43954669406248]
We show that a simple random selection approach toward Emotion Cause Extraction achieves similar performance compared to the baselines. An imbalance of emotional cause location exists in the benchmark, with a majority of cause clauses immediately preceding the central emotion clause. We conclude that it is the innate bias in this benchmark that caused high accuracy rate of these deep learning models in ECE.
arXiv Detail & Related papers (2020-07-16T08:02:36Z)
Rethinking Positional Encoding in Language Pre-training [111.2320727291926]
We show that in absolute positional encoding, the addition operation applied on positional embeddings and word embeddings brings mixed correlations. We propose a new positional encoding method called textbfTransformer with textbfUntied textPositional textbfEncoding (T)
arXiv Detail & Related papers (2020-06-28T13:11:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.