Related papers: LLM-Assisted Content Conditional Debiasing for Fair Text Embedding

LLM-Assisted Content Conditional Debiasing for Fair Text Embedding

URL: http://arxiv.org/abs/2402.14208v3
Date: Mon, 24 Jun 2024 04:49:16 GMT
Title: LLM-Assisted Content Conditional Debiasing for Fair Text Embedding
Authors: Wenlong Deng, Blair Chen, Beidi Zhao, Chiyu Zhang, Xiaoxiao Li, Christos Thrampoulidis,
Abstract summary: This paper proposes a novel method for learning fair text embeddings. We define a novel content-conditional equal distance (CCED) fairness for text embeddings. We also introduce a content-conditional debiasing (CCD) loss to ensure that embeddings of texts with different sensitive attributes but identical content maintain the same distance from the embedding of their corresponding neutral text.
Score: 37.92120550031469
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Mitigating biases in machine learning models has become an increasing concern in Natural Language Processing (NLP), particularly in developing fair text embeddings, which are crucial yet challenging for real-world applications like search engines. In response, this paper proposes a novel method for learning fair text embeddings. First, we define a novel content-conditional equal distance (CCED) fairness for text embeddings, ensuring content-conditional independence between sensitive attributes and text embeddings. Building on CCED, we introduce a content-conditional debiasing (CCD) loss to ensure that embeddings of texts with different sensitive attributes but identical content maintain the same distance from the embedding of their corresponding neutral text. Additionally, we tackle the issue of insufficient training data by using Large Language Models (LLMs) with instructions to fairly augment texts into different sensitive groups. Our extensive evaluations show that our approach effectively enhances fairness while maintaining the utility of embeddings. Furthermore, our augmented dataset, combined with the CCED metric, serves as an new benchmark for evaluating fairness.

Related papers

The Medium Is Not the Message: Deconfounding Text Embeddings via Linear Concept Erasure [91.01653854955286]
Embedding-based similarity metrics can be influenced by spurious attributes like the text's source or language.<n>This paper shows that a debiasing algorithm that removes information about observed confounders from the encoder representations substantially reduces these biases at a minimal computational cost.
arXiv Detail & Related papers (2025-07-01T23:17:12Z)
Fair Text Classification via Transferable Representations [4.555471356313677]
Group fairness is a central research topic in text classification, where reaching fair treatment between sensitive groups remains an open challenge. We propose an approach that extends the use of the Wasserstein Dependency Measure for learning unbiased neural text classifiers. We show that Domain Adaptation can be efficiently leveraged to remove the need for access to the sensitive attributes in the dataset we cure.
arXiv Detail & Related papers (2025-03-10T16:52:45Z)
Following the Autoregressive Nature of LLM Embeddings via Compression and Alignment [69.67015515485349]
We propose AutoRegEmbed, a contrastive learning method built on embedding conditional probability distributions. We show that our method significantly outperforms traditional contrastive learning approaches.
arXiv Detail & Related papers (2025-02-17T03:36:25Z)
Semantic Token Reweighting for Interpretable and Controllable Text Embeddings in CLIP [46.53595526049201]
A text encoder within Vision-Language Models (VLMs) like CLIP plays a crucial role in translating textual input into an embedding space shared with images. We propose a framework of Semantic Token Reweighting to build Interpretable text embeddings (SToRI) SToRI refines the text encoding process in CLIP by differentially weighting semantic elements based on contextual importance.
arXiv Detail & Related papers (2024-10-11T02:42:13Z)
Analysing Zero-Shot Readability-Controlled Sentence Simplification [54.09069745799918]
We investigate how different types of contextual information affect a model's ability to generate sentences with the desired readability. Results show that all tested models struggle to simplify sentences due to models' limitations and characteristics of the source sentences. Our experiments also highlight the need for better automatic evaluation metrics tailored to RCTS.
arXiv Detail & Related papers (2024-09-30T12:36:25Z)
An efficient text augmentation approach for contextualized Mandarin speech recognition [4.600045052545344]
Our study proposes to leverage extensive text-only datasets and contextualize pre-trained ASR models. To contextualize a pre-trained CIF-based ASR, we construct a codebook using limited speech-text data. Our experiments on diverse Mandarin test sets demonstrate that our TA approach significantly boosts recognition performance.
arXiv Detail & Related papers (2024-06-14T11:53:14Z)
CELA: Cost-Efficient Language Model Alignment for CTR Prediction [71.85120354973073]
Click-Through Rate (CTR) prediction holds a paramount position in recommender systems. Recent efforts have sought to mitigate these challenges by integrating Pre-trained Language Models (PLMs) We propose textbfCost-textbfEfficient textbfLanguage Model textbfAlignment (textbfCELA) for CTR prediction.
arXiv Detail & Related papers (2024-05-17T07:43:25Z)
Improving Text Embeddings with Large Language Models [59.930513259982725]
We introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps. We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages. Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data.
arXiv Detail & Related papers (2023-12-31T02:13:18Z)
Constructing Vec-tionaries to Extract Message Features from Texts: A Case Study of Moral Appeals [5.336592570916432]
We present an approach to construct vec-tionary measurement tools that boost validated dictionaries with word embeddings. A vec-tionary can produce additional metrics to capture the ambivalence of a message feature beyond its strength in texts.
arXiv Detail & Related papers (2023-12-10T20:37:29Z)
Demonstrations Are All You Need: Advancing Offensive Content Paraphrasing using In-Context Learning [10.897468059705238]
Supervised paraphrasers rely heavily on large quantities of labelled data to help preserve meaning and intent. In this paper we aim to assist practitioners in developing usable paraphrasers by exploring In-Context Learning (ICL) with large language models (LLMs) Our study focuses on key factors such as - number and order of demonstrations, exclusion of prompt instruction, and reduction in measured toxicity.
arXiv Detail & Related papers (2023-10-16T16:18:55Z)
Conditional Supervised Contrastive Learning for Fair Text Classification [59.813422435604025]
We study learning fair representations that satisfy a notion of fairness known as equalized odds for text classification via contrastive learning. Specifically, we first theoretically analyze the connections between learning representations with a fairness constraint and conditional supervised contrastive objectives.
arXiv Detail & Related papers (2022-05-23T17:38:30Z)
Improving Disentangled Text Representation Learning with Information-Theoretic Guidance [99.68851329919858]
discrete nature of natural language makes disentangling of textual representations more challenging. Inspired by information theory, we propose a novel method that effectively manifests disentangled representations of text. Experiments on both conditional text generation and text-style transfer demonstrate the high quality of our disentangled representation.
arXiv Detail & Related papers (2020-06-01T03:36:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.