Related papers: B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability

B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability

URL: http://arxiv.org/abs/2502.12992v2
Date: Mon, 14 Jul 2025 13:11:13 GMT
Title: B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability
Authors: Yifan Wang, Sukrut Rao, Ji-Ung Lee, Mayank Jobanputra, Vera Demberg,
Abstract summary: We introduce B-cos language models (LMs) empowered for natural language processing (NLP) tasks.<n>Our approach directly transforms pre-trained language models into B-cos LMs by combining B-cos conversion and task fine-tuning.<n>Our automatic and human evaluation results demonstrate that B-cos LMs produce more faithful and human interpretable explanations than post-hoc methods.
Score: 21.480463138209483
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Post-hoc explanation methods for black-box models often struggle with faithfulness and human interpretability due to the lack of explainability in current neural architectures. Meanwhile, B-cos networks have been introduced to improve model explainability by proposing an architecture that removes bias terms and promotes input-weight alignment. Although B-cos networks have shown success in building explainable systems, their application has so far been limited to computer vision models and their associated training pipelines. In this work, we introduce B-cos LMs, i.e., B-cos language models (LMs) empowered for natural language processing (NLP) tasks. Our approach directly transforms pre-trained language models into B-cos LMs by combining B-cos conversion and task fine-tuning, improving efficiency compared to previous methods. Our automatic and human evaluation results demonstrate that B-cos LMs produce more faithful and human interpretable explanations than post-hoc methods, while maintaining task performance comparable to conventional fine-tuning. Our in-depth analysis explores how B-cos LMs differ from conventionally fine-tuned models in their learning processes and explanation patterns. Finally, we are also the first to explore the transformation of decoder-only models to B-cos LMs for generation tasks.

Related papers

Language Model Meets Prototypes: Towards Interpretable Text Classification Models through Prototypical Networks [1.1711824752079485]
dissertation focuses on developing intrinsically interpretable models when using LMs as encoders.<n>I developed a novel white-box multi-head graph attention-based prototype network.<n>I am working on extending the attention-based prototype network with contrastive learning to redesign an interpretable graph neural network.
arXiv Detail & Related papers (2024-12-04T22:59:35Z)
Can bidirectional encoder become the ultimate winner for downstream applications of foundation models? [1.8120356834558644]
Foundational models have the characteristics of pre-training, transfer learning, and self-supervised learning.<n>BERT broke through the limitation of only using one-way methods for language modeling in pre-training by using a masked language model.<n>This article analyzes one-way and bidirectional models based on GPT and BERT and compares their differences based on the purpose of the model.
arXiv Detail & Related papers (2024-11-27T03:31:14Z)
A Bayesian Approach to Data Point Selection [24.98069363998565]
Data point selection (DPS) is becoming a critical topic in deep learning. Existing approaches to DPS are predominantly based on a bi-level optimisation (BLO) formulation. We propose a novel Bayesian approach to DPS.
arXiv Detail & Related papers (2024-11-06T09:04:13Z)
B-cosification: Transforming Deep Neural Networks to be Inherently Interpretable [53.848005910548565]
'B-cosification' is a novel approach to transform existing pre-trained models to become inherently interpretable.<n>We find that B-cosification can yield models that are on par with B-cos models trained from scratch in terms of interpretability.
arXiv Detail & Related papers (2024-11-01T16:28:11Z)
Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning [5.487210426671288]
In this work, we demonstrate that the reasoning abilities of small-scale LMs can be enhanced through self-training. We also show that the conventional self-training can be further augmented by a preference learning algorithm called Direct Preference Optimization.
arXiv Detail & Related papers (2024-07-25T17:59:16Z)
Explaining Text Similarity in Transformer Models [52.571158418102584]
Recent advances in explainable AI have made it possible to mitigate limitations by leveraging improved explanations for Transformers. We use BiLRP, an extension developed for computing second-order explanations in bilinear similarity models, to investigate which feature interactions drive similarity in NLP models. Our findings contribute to a deeper understanding of different semantic similarity tasks and models, highlighting how novel explainable AI methods enable in-depth analyses and corpus-level insights.
arXiv Detail & Related papers (2024-05-10T17:11:31Z)
Overcoming Knowledge Barriers: Online Imitation Learning from Visual Observation with Pretrained World Models [8.77288940968713]
This study identifies two primary obstacles: the Embodiment Knowledge Barrier (EKB) and the Demonstration Knowledge Barrier (DKB) The EKB emerges due to the pretrained models' limitations in handling novel observations, which leads to inaccurate action inference. The DKB stems from the reliance on limited demonstration datasets, restricting the model's adaptability across diverse scenarios. We propose separate solutions to overcome each barrier and apply them to Action Inference by Maximising Evidence (AIME)
arXiv Detail & Related papers (2024-04-29T17:33:52Z)
Black-Box Tuning of Vision-Language Models with Effective Gradient Approximation [71.21346469382821]
We introduce collaborative black-box tuning (CBBT) for both textual prompt optimization and output feature adaptation for black-box models. CBBT is extensively evaluated on eleven downstream benchmarks and achieves remarkable improvements compared to existing black-box VL adaptation methods.
arXiv Detail & Related papers (2023-12-26T06:31:28Z)
An Emulator for Fine-Tuning Large Language Models using Small Language Models [91.02498576056057]
We introduce emulated fine-tuning (EFT), a principled and practical method for sampling from a distribution that approximates the result of pre-training and fine-tuning at different scales. We show that EFT enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training. Finally, a special case of emulated fine-tuning, which we call LM up-scaling, avoids resource-intensive fine-tuning of large pre-trained models by ensembling them with small fine-tuned models.
arXiv Detail & Related papers (2023-10-19T17:57:16Z)
On Conditional and Compositional Language Model Differentiable Prompting [75.76546041094436]
Prompts have been shown to be an effective method to adapt a frozen Pretrained Language Model (PLM) to perform well on downstream tasks. We propose a new model, Prompt Production System (PRopS), which learns to transform task instructions or input metadata, into continuous prompts.
arXiv Detail & Related papers (2023-07-04T02:47:42Z)
Concept-aware Training Improves In-context Learning Ability of Language Models [0.0]
Many recent language models (LMs) of Transformers family exhibit so-called in-context learning (ICL) ability. We propose a method to create LMs able to better utilize the in-context information. We measure that data sampling of Concept-aware Training consistently improves models' reasoning ability.
arXiv Detail & Related papers (2023-05-23T07:44:52Z)
LAP: An Attention-Based Module for Concept Based Self-Interpretation and Knowledge Injection in Convolutional Neural Networks [2.8948274245812327]
We propose a new attention-based pooling layer, called Local Attention Pooling (LAP), that accomplishes self-interpretability. LAP is easily pluggable into any convolutional neural network, even the already trained ones. LAP offers more valid human-understandable and faithful-to-the-model interpretations than the commonly used white-box explainer methods.
arXiv Detail & Related papers (2022-01-27T21:10:20Z)
Explain, Edit, and Understand: Rethinking User Study Design for Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews. We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z)
Interactively Generating Explanations for Transformer Language Models [14.306470205426526]
Transformer language models are state-of-the-art in a multitude of NLP tasks. Recent methods aim to provide interpretability and explainability to black-box models. We emphasize using prototype networks directly incorporated into the model architecture.
arXiv Detail & Related papers (2021-09-02T11:34:29Z)
Joint Energy-based Model Training for Better Calibrated Natural Language Understanding Models [61.768082640087]
We explore joint energy-based model (EBM) training during the finetuning of pretrained text encoders for natural language understanding tasks. Experiments show that EBM training can help the model reach a better calibration that is competitive to strong baselines.
arXiv Detail & Related papers (2021-01-18T01:41:31Z)
Leap-Of-Thought: Teaching Pre-Trained Models to Systematically Reason Over Implicit Knowledge [96.92252296244233]
Large pre-trained language models (LMs) acquire some reasoning capacity, but this ability is difficult to control. We show that LMs can be trained to reliably perform systematic reasoning combining both implicit, pre-trained knowledge and explicit natural language statements. Our work paves a path towards open-domain systems that constantly improve by interacting with users who can instantly correct a model by adding simple natural language statements.
arXiv Detail & Related papers (2020-06-11T17:02:20Z)
oLMpics -- On what Language Model Pre-training Captures [84.60594612120173]
We propose eight reasoning tasks, which require operations such as comparison, conjunction, and composition. A fundamental challenge is to understand whether the performance of a LM on a task should be attributed to the pre-trained representations or to the process of fine-tuning on the task data.
arXiv Detail & Related papers (2019-12-31T12:11:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.