Related papers: Self-Regularization with Sparse Autoencoders for Controllable LLM-based Classification

Self-Regularization with Sparse Autoencoders for Controllable LLM-based Classification

URL: http://arxiv.org/abs/2502.14133v3
Date: Sun, 27 Jul 2025 20:44:09 GMT
Title: Self-Regularization with Sparse Autoencoders for Controllable LLM-based Classification
Authors: Xuansheng Wu, Wenhao Yu, Xiaoming Zhai, Ninghao Liu,
Abstract summary: We propose a novel framework to identify and regularize unintended features in large language models (LLMs) latent spaces.<n>We evaluate the proposed framework on three real-world tasks, including toxic chat detection, reward modeling, and disease diagnosis.
Score: 29.74457390987092
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Modern text classification methods heavily rely on contextual embeddings from large language models (LLMs). Compared to human-engineered features, these embeddings provide automatic and effective representations for classification model training. However, they also introduce a challenge: we lose the ability to manually remove unintended features, such as sensitive or task-irrelevant features, to guarantee regulatory compliance or improve the generalizability of classification models. This limitation arises because LLM embeddings are opaque and difficult to interpret. In this paper, we propose a novel framework to identify and regularize unintended features in the LLM latent space. Specifically, we first pre-train a sparse autoencoder (SAE) to extract interpretable features from LLM latent spaces. To ensure the SAE can capture task-specific features, we further fine-tune it on task-specific datasets. In training the classification model, we propose a simple and effective regularizer, by minimizing the similarity between the classifier weights and the identified unintended feature, to remove the impact of these unintended features on classification. We evaluate the proposed framework on three real-world tasks, including toxic chat detection, reward modeling, and disease diagnosis. Results show that the proposed self-regularization framework can improve the classifier's generalizability by regularizing those features that are not semantically correlated to the task. This work pioneers controllable text classification on LLM latent spaces by leveraging interpreted features to address generalizability, fairness, and privacy challenges. The code and data are publicly available at https://github.com/JacksonWuxs/Controllable_LLM_Classifier.

Related papers

Attribute-formed Class-specific Concept Space: Endowing Language Bottleneck Model with Better Interpretability and Scalability [54.420663939897686]
We propose the Attribute-formed Language Bottleneck Model (ALBM) to achieve interpretable image recognition. ALBM organizes concepts in the attribute-formed class-specific space, where concepts are descriptions of specific attributes for specific classes. To further improve interpretability, we propose Visual Attribute Prompt Learning (VAPL) to extract visual features on fine-grained attributes.
arXiv Detail & Related papers (2025-03-26T07:59:04Z)
Idiosyncrasies in Large Language Models [54.26923012617675]
We unveil and study idiosyncrasies in Large Language Models (LLMs)<n>We find that fine-tuning text embedding models on LLM-generated texts yields excellent classification accuracy.<n>We leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies.
arXiv Detail & Related papers (2025-02-17T18:59:02Z)
Disentangling CLIP Features for Enhanced Localized Understanding [58.73850193789384]
We propose Unmix-CLIP, a novel framework designed to reduce mutual feature information (MFI) and improve feature disentanglement.<n>For the COCO- 14 dataset, Unmix-CLIP reduces feature similarity by 24.9%.
arXiv Detail & Related papers (2025-02-05T08:20:31Z)
Predicting the Performance of Black-box LLMs through Self-Queries [60.87193950962585]
Large language models (LLMs) are increasingly relied on in AI systems, predicting when they make mistakes is crucial.<n>In this paper, we extract features of LLMs in a black-box manner by using follow-up prompts and taking the probabilities of different responses as representations.<n>We demonstrate that training a linear model on these low-dimensional representations produces reliable predictors of model performance at the instance level.
arXiv Detail & Related papers (2025-01-02T22:26:54Z)
Collaborative Feature-Logits Contrastive Learning for Open-Set Semi-Supervised Object Detection [75.02249869573994]
In open-set scenarios, the unlabeled dataset contains both in-distribution (ID) classes and out-of-distribution (OOD) classes.<n>Applying semi-supervised detectors in such settings can lead to misclassifying OOD class as ID classes.<n>We propose a simple yet effective method, termed Collaborative Feature-Logits Detector (CFL-Detector)
arXiv Detail & Related papers (2024-11-20T02:57:35Z)
Exploring Iterative Controllable Summarization with Large Language Models [22.80433394369022]
Large language models (LLMs) have demonstrated remarkable performance in abstractive summarization tasks. Our findings show that LLMs struggle more with numerical attributes than with linguistic attributes. We propose a guide-to-explain framework (GTE) for controllable summarization.
arXiv Detail & Related papers (2024-11-19T12:36:02Z)
LLM-based feature generation from text for interpretable machine learning [0.0]
Existing text representations such as embeddings and bag-of-words are not suitable for rule learning due to their high dimensionality and absent or questionable feature-level interpretability. This article explores whether large language models (LLMs) could address this by extracting a small number of interpretable features from text.
arXiv Detail & Related papers (2024-09-11T09:29:28Z)
AutoML-guided Fusion of Entity and LLM-based Representations for Document Classification [43.56253799373878]
This work demonstrates that injecting embedded information from knowledge bases can augment the performance of contemporary Large Language Model (LLM)-based representations for the task of text classification. By considering automated machine learning (AutoML) with the fused representation space, we demonstrate it is possible to improve classification accuracy even if we use low-dimensional projections of the original representation space.
arXiv Detail & Related papers (2024-08-19T08:41:40Z)
Co-training for Low Resource Scientific Natural Language Inference [65.37685198688538]
We propose a novel co-training method that assigns weights based on the training dynamics of the classifiers to the distantly supervised labels. By assigning importance weights instead of filtering out examples based on an arbitrary threshold on the predicted confidence, we maximize the usage of automatically labeled data. The proposed method obtains an improvement of 1.5% in Macro F1 over the distant supervision baseline, and substantial improvements over several other strong SSL baselines.
arXiv Detail & Related papers (2024-06-20T18:35:47Z)
SEER-ZSL: Semantic Encoder-Enhanced Representations for Generalized Zero-Shot Learning [0.6792605600335813]
Zero-Shot Learning (ZSL) presents the challenge of identifying categories not seen during training.<n>We introduce a Semantic-Enhanced Representations for Zero-Shot Learning (SEER-ZSL)<n>First, we aim to distill meaningful semantic information using a probabilistic encoder, enhancing the semantic consistency and robustness.<n>Second, we distill the visual space by exploiting the learned data distribution through an adversarially trained generator. Third, we align the distilled information, enabling a mapping of unseen categories onto the true data manifold.
arXiv Detail & Related papers (2023-12-20T15:18:51Z)
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations [29.32704733570445]
We introduce Llama Guard, an input-output safeguard model geared towards Human-AI conversation use cases. Llama Guard incorporates a safety risk taxonomy, a valuable tool for categorizing a specific set of safety risks. Llama Guard demonstrates strong performance on existing benchmarks such as the OpenAI Moderation Evaluation dataset and ToxicChat.
arXiv Detail & Related papers (2023-12-07T19:40:50Z)
SPIN: Sparsifying and Integrating Internal Neurons in Large Language Models for Text Classification [6.227343685358882]
We present a model-agnostic framework that sparsifies and integrates internal neurons of intermediate layers of Large Language Models for text classification. SPIN significantly improves text classification accuracy, efficiency, and interpretability.
arXiv Detail & Related papers (2023-11-27T16:28:20Z)
Token Prediction as Implicit Classification to Identify LLM-Generated Text [37.89852204279844]
This paper introduces a novel approach for identifying the possible large language models (LLMs) involved in text generation. Instead of adding an additional classification layer to a base LM, we reframe the classification task as a next-token prediction task. We utilize the Text-to-Text Transfer Transformer (T5) model as the backbone for our experiments.
arXiv Detail & Related papers (2023-11-15T06:33:52Z)
Waffling around for Performance: Visual Classification with Random Words and Broad Concepts [121.60918966567657]
WaffleCLIP is a framework for zero-shot visual classification which simply replaces LLM-generated descriptors with random character and word descriptors. We conduct an extensive experimental study on the impact and shortcomings of additional semantics introduced with LLM-generated descriptors.
arXiv Detail & Related papers (2023-06-12T17:59:48Z)
Learning Common Rationale to Improve Self-Supervised Representation for Fine-Grained Visual Recognition Problems [61.11799513362704]
We propose learning an additional screening mechanism to identify discriminative clues commonly seen across instances and classes. We show that a common rationale detector can be learned by simply exploiting the GradCAM induced from the SSL objective.
arXiv Detail & Related papers (2023-03-03T02:07:40Z)
Task-Specific Embeddings for Ante-Hoc Explainable Text Classification [6.671252951387647]
We propose an alternative training objective in which we learn task-specific embeddings of text. Our proposed objective learns embeddings such that all texts that share the same target class label should be close together. We present extensive experiments which show that the benefits of ante-hoc explainability and incremental learning come at no cost in overall classification accuracy.
arXiv Detail & Related papers (2022-11-30T19:56:25Z)
PatchMix Augmentation to Identify Causal Features in Few-shot Learning [55.64873998196191]
Few-shot learning aims to transfer knowledge learned from base with sufficient categories labelled data to novel categories with scarce known information. We propose a novel data augmentation strategy dubbed as PatchMix that can break this spurious dependency. We show that such an augmentation mechanism, different from existing ones, is able to identify the causal features.
arXiv Detail & Related papers (2022-11-29T08:41:29Z)
Learning Debiased and Disentangled Representations for Semantic Segmentation [52.35766945827972]
We propose a model-agnostic and training scheme for semantic segmentation. By randomly eliminating certain class information in each training iteration, we effectively reduce feature dependencies among classes. Models trained with our approach demonstrate strong results on multiple semantic segmentation benchmarks.
arXiv Detail & Related papers (2021-10-31T16:15:09Z)
SCARF: Self-Supervised Contrastive Learning using Random Feature Corruption [72.35532598131176]
We propose SCARF, a technique for contrastive learning, where views are formed by corrupting a random subset of features. We show that SCARF complements existing strategies and outperforms alternatives like autoencoders.
arXiv Detail & Related papers (2021-06-29T08:08:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.