Related papers: Paying Alignment Tax with Contrastive Learning

Paying Alignment Tax with Contrastive Learning

URL: http://arxiv.org/abs/2505.19327v1
Date: Sun, 25 May 2025 21:26:18 GMT
Title: Paying Alignment Tax with Contrastive Learning
Authors: Buse Sibel Korkmaz, Rahul Nair, Elizabeth M. Daly, Antonio del Rio Chanona,
Abstract summary: Current debiasing approaches often result in a degradation in model capabilities such as factual accuracy and knowledge retention.<n>We propose a contrastive learning framework that learns through carefully constructed positive and negative examples.
Score: 6.232983467016873
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Current debiasing approaches often result a degradation in model capabilities such as factual accuracy and knowledge retention. Through systematic evaluation across multiple benchmarks, we demonstrate that existing debiasing methods face fundamental trade-offs, particularly in smaller models, leading to reduced truthfulness, knowledge loss, or unintelligible outputs. To address these limitations, we propose a contrastive learning framework that learns through carefully constructed positive and negative examples. Our approach introduces contrast computation and dynamic loss scaling to balance bias mitigation with faithfulness preservation. Experimental results across multiple model scales demonstrate that our method achieves substantial improvements in both toxicity reduction and faithfulness preservation. Most importantly, we show that our framework is the first to consistently improve both metrics simultaneously, avoiding the capability degradation characteristic of existing approaches. These results suggest that explicit modeling of both positive and negative examples through contrastive learning could be a promising direction for reducing the alignment tax in language model debiasing.

Related papers

An Attention-based Framework for Fair Contrastive Learning [2.1605931466490795]
We propose a new method for fair contrastive learning that employs an attention mechanism to model bias-causing interactions. Our attention mechanism avoids bias-causing samples that confound the model and focuses on bias-reducing samples that help learn semantically meaningful representations.
arXiv Detail & Related papers (2024-11-22T07:11:35Z)
Dissecting Representation Misalignment in Contrastive Learning via Influence Function [15.28417468377201]
We introduce the Extended Influence Function for Contrastive Loss (ECIF), an influence function crafted for contrastive loss.<n>ECIF considers both positive and negative samples and provides a closed-form approximation of contrastive learning models.<n>Building upon ECIF, we develop a series of algorithms for data evaluation, misalignment detection, and misprediction trace-back tasks.
arXiv Detail & Related papers (2024-11-18T15:45:41Z)
Learning Confidence Bounds for Classification with Imbalanced Data [42.690254618937196]
We propose a novel framework that leverages learning theory and concentration inequalities to overcome the shortcomings of traditional solutions. Our method can effectively adapt to the varying degrees of imbalance across different classes, resulting in more robust and reliable classification outcomes.
arXiv Detail & Related papers (2024-07-16T16:02:27Z)
Utilizing Adversarial Examples for Bias Mitigation and Accuracy Enhancement [3.0820287240219795]
We propose a novel approach to mitigate biases in computer vision models by utilizing counterfactual generation and fine-tuning. Our approach leverages a curriculum learning framework combined with a fine-grained adversarial loss to fine-tune the model using adversarial examples. We validate our approach through both qualitative and quantitative assessments, demonstrating improved bias mitigation and accuracy compared to existing methods.
arXiv Detail & Related papers (2024-04-18T00:41:32Z)
Time-Series Contrastive Learning against False Negatives and Class Imbalance [17.43801009251228]
We conduct theoretical analysis and find they have overlooked the fundamental issues: false negatives and class imbalance inherent in the InfoNCE loss-based framework. We introduce a straightforward modification grounded in the SimCLR framework, universally to models engaged in the instance discrimination task. We perform semi-supervised consistency classification and enhance the representative ability of minority classes.
arXiv Detail & Related papers (2023-12-19T08:38:03Z)
Towards Calibrated Robust Fine-Tuning of Vision-Language Models [97.19901765814431]
This work proposes a robust fine-tuning method that improves both OOD accuracy and confidence calibration simultaneously in vision language models. We show that both OOD classification and OOD calibration errors have a shared upper bound consisting of two terms of ID data. Based on this insight, we design a novel framework that conducts fine-tuning with a constrained multimodal contrastive loss enforcing a larger smallest singular value.
arXiv Detail & Related papers (2023-11-03T05:41:25Z)
Unmasking Bias in Diffusion Model Training [40.90066994983719]
Denoising diffusion models have emerged as a dominant approach for image generation. They still suffer from slow convergence in training and color shift issues in sampling. In this paper, we identify that these obstacles can be largely attributed to bias and suboptimality inherent in the default training paradigm.
arXiv Detail & Related papers (2023-10-12T16:04:41Z)
Contrastive Learning for Fair Representations [50.95604482330149]
Trained classification models can unintentionally lead to biased representations and predictions. Existing debiasing methods for classification models, such as adversarial training, are often expensive to train and difficult to optimise. We propose a method for mitigating bias by incorporating contrastive learning, in which instances sharing the same class label are encouraged to have similar representations.
arXiv Detail & Related papers (2021-09-22T10:47:51Z)
Incremental False Negative Detection for Contrastive Learning [95.68120675114878]
We introduce a novel incremental false negative detection for self-supervised contrastive learning. During contrastive learning, we discuss two strategies to explicitly remove the detected false negatives. Our proposed method outperforms other self-supervised contrastive learning frameworks on multiple benchmarks within a limited compute.
arXiv Detail & Related papers (2021-06-07T15:29:14Z)
Solving Inefficiency of Self-supervised Representation Learning [87.30876679780532]
Existing contrastive learning methods suffer from very low learning efficiency. Under-clustering and over-clustering problems are major obstacles to learning efficiency. We propose a novel self-supervised learning framework using a median triplet loss.
arXiv Detail & Related papers (2021-04-18T07:47:10Z)
Counterfactual Representation Learning with Balancing Weights [74.67296491574318]
Key to causal inference with observational data is achieving balance in predictive features associated with each treatment type. Recent literature has explored representation learning to achieve this goal. We develop an algorithm for flexible, scalable and accurate estimation of causal effects.
arXiv Detail & Related papers (2020-10-23T19:06:03Z)
Learning the Truth From Only One Side of the Story [58.65439277460011]
We focus on generalized linear models and show that without adjusting for this sampling bias, the model may converge suboptimally or even fail to converge to the optimal solution. We propose an adaptive approach that comes with theoretical guarantees and show that it outperforms several existing methods empirically.
arXiv Detail & Related papers (2020-06-08T18:20:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.