Expert-Guided Extinction of Toxic Tokens for Debiased Generation
- URL: http://arxiv.org/abs/2405.19299v1
- Date: Wed, 29 May 2024 17:26:52 GMT
- Title: Expert-Guided Extinction of Toxic Tokens for Debiased Generation
- Authors: Xueyao Sun, Kaize Shi, Haoran Tang, Guandong Xu, Qing Li,
- Abstract summary: Large language models (LLMs) can elicit social bias during generations, especially when inference with toxic prompts.
We propose the Expert-Guided Extinction of Toxic Tokens for Debiased Generation (EXPOSED) to eliminate the undesired harmful outputs.
- Score: 16.99272541576084
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) can elicit social bias during generations, especially when inference with toxic prompts. Controlling the sensitive attributes in generation encounters challenges in data distribution, generalizability, and efficiency. Specifically, fine-tuning and retrieval demand extensive unbiased corpus, while direct prompting requires meticulously curated instructions for correcting the output in multiple rounds of thoughts but poses challenges on memory and inference latency. In this work, we propose the Expert-Guided Extinction of Toxic Tokens for Debiased Generation (EXPOSED) to eliminate the undesired harmful outputs for LLMs without the aforementioned requirements. EXPOSED constructs a debiasing expert based on the abundant toxic corpus to expose and elicit the potentially dangerous tokens. It then processes the output to the LLMs and constructs a fair distribution by suppressing and attenuating the toxic tokens. EXPOSED is evaluated on fairness benchmarks over three LLM families. Extensive experiments demonstrate that compared with other baselines, the proposed EXPOSED significantly reduces the potential social bias while balancing fairness and generation performance.
Related papers
- LIDAO: Towards Limited Interventions for Debiasing (Large) Language Models [19.18522268167047]
Large language models (LLMs) have achieved impressive performance on various natural language generation tasks.
However, they suffer from generating negative and harmful contents that are biased against certain demographic groups.
We propose LIDAO, a framework to debias a (L)LM at a better fluency provably.
arXiv Detail & Related papers (2024-06-01T20:12:54Z) - A Causal Explainable Guardrails for Large Language Models [29.441292837667415]
Large Language Models (LLMs) have shown impressive performance in natural language tasks, but their outputs can exhibit undesirable attributes or biases.
Existing methods for steering LLMs towards desired attributes often assume unbiased representations and rely solely on steering prompts.
We propose LLMGuardaril, a novel framework that incorporates causal analysis and adversarial learning to obtain unbiased steering representations.
arXiv Detail & Related papers (2024-05-07T09:55:05Z) - Understanding Privacy Risks of Embeddings Induced by Large Language Models [75.96257812857554]
Large language models show early signs of artificial general intelligence but struggle with hallucinations.
One promising solution is to store external knowledge as embeddings, aiding LLMs in retrieval-augmented generation.
Recent studies experimentally showed that the original text can be partially reconstructed from text embeddings by pre-trained language models.
arXiv Detail & Related papers (2024-04-25T13:10:48Z) - Take Care of Your Prompt Bias! Investigating and Mitigating Prompt Bias in Factual Knowledge Extraction [56.17020601803071]
Recent research shows that pre-trained language models (PLMs) suffer from "prompt bias" in factual knowledge extraction.
This paper aims to improve the reliability of existing benchmarks by thoroughly investigating and mitigating prompt bias.
arXiv Detail & Related papers (2024-03-15T02:04:35Z) - Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification [116.77055746066375]
Large language models (LLMs) are notorious for hallucinating, i.e., producing erroneous claims in their output.
We propose a novel fact-checking and hallucination detection pipeline based on token-level uncertainty quantification.
arXiv Detail & Related papers (2024-03-07T17:44:17Z) - Self-Debiasing Large Language Models: Zero-Shot Recognition and
Reduction of Stereotypes [73.12947922129261]
We leverage the zero-shot capabilities of large language models to reduce stereotyping.
We show that self-debiasing can significantly reduce the degree of stereotyping across nine different social groups.
We hope this work opens inquiry into other zero-shot techniques for bias mitigation.
arXiv Detail & Related papers (2024-02-03T01:40:11Z) - Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools.
Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions.
Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z) - Stochastic Parrots Looking for Stochastic Parrots: LLMs are Easy to
Fine-Tune and Hard to Detect with other LLMs [6.295207672539996]
We show that an attacker with access to detectors' reference human texts and output can fully frustrate the detector training.
We warn against the temptation to transpose the conclusions obtained in RNN-driven text GANs to LLMs.
These results have critical implications for the detection and prevention of malicious use of generative language models.
arXiv Detail & Related papers (2023-04-18T13:05:01Z) - Unified Detoxifying and Debiasing in Language Generation via
Inference-time Adaptive Optimization [32.50246008433889]
Pre-trained language models (PLMs) have prospered in various natural language generation (NLG) tasks due to their ability to generate fairly fluent text.
These models are observed to capture and reproduce harmful contents in training corpora, typically toxic language and social biases, raising severe moral issues.
We propose the first unified framework of detoxifying and debiasing called UDDIA, which jointly formalizes these two problems as rectifying the output space.
arXiv Detail & Related papers (2022-10-10T08:45:25Z) - Unsupervised Learning of Debiased Representations with Pseudo-Attributes [85.5691102676175]
We propose a simple but effective debiasing technique in an unsupervised manner.
We perform clustering on the feature embedding space and identify pseudoattributes by taking advantage of the clustering results.
We then employ a novel cluster-based reweighting scheme for learning debiased representation.
arXiv Detail & Related papers (2021-08-06T05:20:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.