Related papers: Flexible text generation for counterfactual fairness probing

Flexible text generation for counterfactual fairness probing

URL: http://arxiv.org/abs/2206.13757v1
Date: Tue, 28 Jun 2022 05:07:20 GMT
Title: Flexible text generation for counterfactual fairness probing
Authors: Zee Fryer, Vera Axelrod, Ben Packer, Alex Beutel, Jilin Chen, Kellie Webster
Abstract summary: A common approach for testing fairness issues in text-based classifiers is through the use of counterfactuals. Existing counterfactual generation methods rely on wordlists or templates, producing simple counterfactuals that don't take into account grammar, context, or subtle sensitive attribute references. In this paper, we introduce a task for generating counterfactuals that overcomes these shortcomings, and demonstrate how large language models (LLMs) can be leveraged to make progress on this task.
Score: 8.262741696221143
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A common approach for testing fairness issues in text-based classifiers is through the use of counterfactuals: does the classifier output change if a sensitive attribute in the input is changed? Existing counterfactual generation methods typically rely on wordlists or templates, producing simple counterfactuals that don't take into account grammar, context, or subtle sensitive attribute references, and could miss issues that the wordlist creators had not considered. In this paper, we introduce a task for generating counterfactuals that overcomes these shortcomings, and demonstrate how large language models (LLMs) can be leveraged to make progress on this task. We show that this LLM-based method can produce complex counterfactuals that existing methods cannot, comparing the performance of various counterfactual generation methods on the Civil Comments dataset and showing their value in evaluating a toxicity classifier.

Related papers

Learning Robust Negation Text Representations [60.23044940174016]
We propose a strategy to improve negation of text encoders using diverse patterns of negation and hedging.<n>We observe large improvement in negation understanding capabilities while maintaining competitive performance on general benchmarks.<n>Our method can be adapted to LLMs, leading to improved performance on negation benchmarks.
arXiv Detail & Related papers (2025-07-17T04:48:54Z)
Verified Language Processing with Hybrid Explainability: A Technical Report [0.7066382982173529]
We present a novel pipeline designed for hybrid explainability to address this.<n>Our methodology combines graphs and logic to produce First-Order Logic representations, creating machine- and human-readable representations through Montague Grammar.<n>Preliminary results indicate the effectiveness of this approach in capturing full text similarity.
arXiv Detail & Related papers (2025-07-07T14:00:05Z)
Guiding LLMs to Generate High-Fidelity and High-Quality Counterfactual Explanations for Text Classification [2.899704155417792]
We introduce two simple classifier-guided approaches to support counterfactual generation by Large Language Models. Despite their simplicity, our methods outperform state-of-the-art counterfactual generation methods.
arXiv Detail & Related papers (2025-03-06T14:15:07Z)
Self-Regularization with Sparse Autoencoders for Controllable LLM-based Classification [29.74457390987092]
We propose a novel framework to identify and regularize unintended features in large language models (LLMs) latent spaces.<n>We evaluate the proposed framework on three real-world tasks, including toxic chat detection, reward modeling, and disease diagnosis.
arXiv Detail & Related papers (2025-02-19T22:27:59Z)
A Comparative Analysis of Counterfactual Explanation Methods for Text Classifiers [0.0]
We evaluate five methods for generating counterfactual explanations for a BERT text classifier. established white-box substitution-based methods are effective at generating valid counterfactuals that change the classifier's output. newer methods based on large language models (LLMs) excel at producing natural and linguistically plausible text counterfactuals.
arXiv Detail & Related papers (2024-11-04T22:01:52Z)
Localizing Factual Inconsistencies in Attributable Text Generation [91.981439746404]
We introduce QASemConsistency, a new formalism for localizing factual inconsistencies in attributable text generation. We first demonstrate the effectiveness of the QASemConsistency methodology for human annotation. We then implement several methods for automatically detecting localized factual inconsistencies.
arXiv Detail & Related papers (2024-10-09T22:53:48Z)
Learning Context-aware Classifier for Semantic Segmentation [88.88198210948426]
In this paper, contextual hints are exploited via learning a context-aware classifier. Our method is model-agnostic and can be easily applied to generic segmentation models. With only negligible additional parameters and +2% inference time, decent performance gain has been achieved on both small and large models.
arXiv Detail & Related papers (2023-03-21T07:00:35Z)
Explaining Image Classifiers Using Contrastive Counterfactuals in Generative Latent Spaces [12.514483749037998]
We introduce a novel method to generate causal and yet interpretable counterfactual explanations for image classifiers. We use this framework to obtain contrastive and causal sufficiency and necessity scores as global explanations for black-box classifiers.
arXiv Detail & Related papers (2022-06-10T17:54:46Z)
Classifiers are Better Experts for Controllable Text Generation [63.17266060165098]
We show that the proposed method significantly outperforms recent PPLM, GeDi, and DExperts on PPL and sentiment accuracy based on the external classifier of generated texts. The same time, it is also easier to implement and tune, and has significantly fewer restrictions and requirements.
arXiv Detail & Related papers (2022-05-15T12:58:35Z)
Comparing Text Representations: A Theory-Driven Approach [2.893558866535708]
We adapt general tools from computational learning theory to fit the specific characteristics of text datasets. We present a method to evaluate the compatibility between representations and tasks. This method provides a calibrated, quantitative measure of the difficulty of a classification-based NLP task.
arXiv Detail & Related papers (2021-09-15T17:48:19Z)
Experiments with adversarial attacks on text genres [0.0]
Neural models based on pre-trained transformers, such as BERT or XLM-RoBERTa, demonstrate SOTA results in many NLP tasks. We show that embedding-based algorithms which can replace some of the most significant'' words with words similar to them, have the ability to influence model predictions in a significant proportion of cases.
arXiv Detail & Related papers (2021-07-05T19:37:59Z)
Evaluating Factuality in Generation with Dependency-level Entailment [57.5316011554622]
We propose a new formulation of entailment that decomposes it at the level of dependency arcs. We show that our dependency arc entailment model trained on this data can identify factual inconsistencies in paraphrasing and summarization better than sentence-level methods.
arXiv Detail & Related papers (2020-10-12T06:43:10Z)
Contextualized Perturbation for Textual Adversarial Attack [56.370304308573274]
Adversarial examples expose the vulnerabilities of natural language processing (NLP) models. This paper presents CLARE, a ContextuaLized AdversaRial Example generation model that produces fluent and grammatical outputs.
arXiv Detail & Related papers (2020-09-16T06:53:15Z)
Interpretable Entity Representations through Large-Scale Typing [61.4277527871572]
We present an approach to creating entity representations that are human readable and achieve high performance out of the box. Our representations are vectors whose values correspond to posterior probabilities over fine-grained entity types. We show that it is possible to reduce the size of our type set in a learning-based way for particular domains.
arXiv Detail & Related papers (2020-04-30T23:58:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.