The Medium Is Not the Message: Deconfounding Document Embeddings via Linear Concept Erasure
- URL: http://arxiv.org/abs/2507.01234v3
- Date: Wed, 24 Sep 2025 07:40:38 GMT
- Title: The Medium Is Not the Message: Deconfounding Document Embeddings via Linear Concept Erasure
- Authors: Yu Fan, Yang Tian, Shauli Ravfogel, Mrinmaya Sachan, Elliott Ash, Alexander Hoyle,
- Abstract summary: Embedding-based similarity metrics can be influenced by spurious attributes like the text's source or language.<n>This paper shows that a debiasing algorithm that removes information about observed confounders from the encoder representations substantially reduces these biases at a minimal computational cost.
- Score: 98.71456610527598
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Embedding-based similarity metrics between text sequences can be influenced not just by the content dimensions we most care about, but can also be biased by spurious attributes like the text's source or language. These document confounders cause problems for many applications, but especially those that need to pool texts from different corpora. This paper shows that a debiasing algorithm that removes information about observed confounders from the encoder representations substantially reduces these biases at a minimal computational cost. Document similarity and clustering metrics improve across every embedding variant and task we evaluate -- often dramatically. Interestingly, performance on out-of-distribution benchmarks is not impacted, indicating that the embeddings are not otherwise degraded.
Related papers
- SMILE: A Composite Lexical-Semantic Metric for Question-Answering Evaluation [55.26111461168754]
We introduce SMILE: Semantic Metric Integrating Lexical Exactness, a novel approach that combines sentence-level semantic understanding with keyword-level semantic understanding and easy keyword matching.<n>It is highly correlated with human judgments and computationally lightweight, bridging the gap between lexical and semantic evaluation.
arXiv Detail & Related papers (2025-11-21T17:30:18Z) - Quantifying Positional Biases in Text Embedding Models [9.735115681462707]
We investigate the impact of content position and input size on text embeddings.<n>Our experiments reveal that embedding models, irrespective of their positional encoding mechanisms, disproportionately prioritize the beginning of an input.
arXiv Detail & Related papers (2024-12-13T09:52:25Z) - Using Similarity to Evaluate Factual Consistency in Summaries [2.7595794227140056]
Abstractive summarisers generate fluent summaries, but the factuality of the generated text is not guaranteed.
We propose a new zero-shot factuality evaluation metric, Sentence-BERTScore (SBERTScore), which compares sentences between the summary and the source document.
Our experiments indicate that each technique has different strengths, with SBERTScore particularly effective in identifying correct summaries.
arXiv Detail & Related papers (2024-09-23T15:02:38Z) - Evaluating Text Classification Robustness to Part-of-Speech Adversarial Examples [0.6445605125467574]
Adversarial examples are inputs that are designed to trick the decision making process, and are intended to be imperceptible to humans.
For text-based classification systems, changes to the input, a string of text, are always perceptible.
To improve the quality of text-based adversarial examples, we need to know what elements of the input text are worth focusing on.
arXiv Detail & Related papers (2024-08-15T18:33:54Z) - Revisiting Evaluation Metrics for Semantic Segmentation: Optimization
and Evaluation of Fine-grained Intersection over Union [113.20223082664681]
We propose the use of fine-grained mIoUs along with corresponding worst-case metrics.
These fine-grained metrics offer less bias towards large objects, richer statistical information, and valuable insights into model and dataset auditing.
Our benchmark study highlights the necessity of not basing evaluations on a single metric and confirms that fine-grained mIoUs reduce the bias towards large objects.
arXiv Detail & Related papers (2023-10-30T03:45:15Z) - Improving Long Context Document-Level Machine Translation [51.359400776242786]
Document-level context for neural machine translation (NMT) is crucial to improve translation consistency and cohesion.
Many works have been published on the topic of document-level NMT, but most restrict the system to just local context.
We propose a constrained attention variant that focuses the attention on the most relevant parts of the sequence, while simultaneously reducing the memory consumption.
arXiv Detail & Related papers (2023-06-08T13:28:48Z) - Revisiting text decomposition methods for NLI-based factuality scoring
of summaries [9.044665059626958]
We show that fine-grained decomposition is not always a winning strategy for factuality scoring.
We also show that small changes to previously proposed entailment-based scoring methods can result in better performance.
arXiv Detail & Related papers (2022-11-30T09:54:37Z) - SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations.
We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences.
Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z) - Specialized Document Embeddings for Aspect-based Similarity of Research
Papers [4.661692753666685]
We treat aspect-based similarity as a classical vector similarity problem in aspect-specific embedding spaces.
We represent a document not as a single generic embedding but as multiple specialized embeddings.
Our approach mitigates potential risks arising from implicit biases by making them explicit.
arXiv Detail & Related papers (2022-03-28T07:35:26Z) - Representing Mixtures of Word Embeddings with Mixtures of Topic
Embeddings [46.324584649014284]
A topic model is often formulated as a generative model that explains how each word of a document is generated given a set of topics and document-specific topic proportions.
This paper introduces a new topic-modeling framework where each document is viewed as a set of word embedding vectors and each topic is modeled as an embedding vector in the same embedding space.
Embedding the words and topics in the same vector space, we define a method to measure the semantic difference between the embedding vectors of the words of a document and these of the topics, and optimize the topic embeddings to minimize the expected difference over all documents.
arXiv Detail & Related papers (2022-03-03T08:46:23Z) - Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation.
This paper aims to address the issue with a mask-and-predict strategy.
We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions.
Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z) - Author Clustering and Topic Estimation for Short Texts [69.54017251622211]
We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document.
We also simultaneously cluster users, removing the need for post-hoc cluster estimation.
Our method performs as well as -- or better -- than traditional approaches to problems arising in short text.
arXiv Detail & Related papers (2021-06-15T20:55:55Z) - Be More with Less: Hypergraph Attention Networks for Inductive Text
Classification [56.98218530073927]
Graph neural networks (GNNs) have received increasing attention in the research community and demonstrated their promising results on this canonical task.
Despite the success, their performance could be largely jeopardized in practice since they are unable to capture high-order interaction between words.
We propose a principled model -- hypergraph attention networks (HyperGAT) which can obtain more expressive power with less computational consumption for text representation learning.
arXiv Detail & Related papers (2020-11-01T00:21:59Z) - Weakly-Supervised Aspect-Based Sentiment Analysis via Joint
Aspect-Sentiment Topic Embedding [71.2260967797055]
We propose a weakly-supervised approach for aspect-based sentiment analysis.
We learn sentiment, aspect> joint topic embeddings in the word embedding space.
We then use neural models to generalize the word-level discriminative information.
arXiv Detail & Related papers (2020-10-13T21:33:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.