Explaining Hate Speech Classification with Model Agnostic Methods
- URL: http://arxiv.org/abs/2306.00021v1
- Date: Tue, 30 May 2023 19:52:56 GMT
- Title: Explaining Hate Speech Classification with Model Agnostic Methods
- Authors: Durgesh Nandini and Ute Schmid
- Abstract summary: The research goal of this paper is to bridge the gap between hate speech prediction and the explanations generated by the system to support its decision.
This has been achieved by first predicting the classification of a text and then providing a posthoc, model agnostic and surrogate interpretability approach.
- Score: 0.9990687944474738
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: There have been remarkable breakthroughs in Machine Learning and Artificial
Intelligence, notably in the areas of Natural Language Processing and Deep
Learning. Additionally, hate speech detection in dialogues has been gaining
popularity among Natural Language Processing researchers with the increased use
of social media. However, as evidenced by the recent trends, the need for the
dimensions of explainability and interpretability in AI models has been deeply
realised. Taking note of the factors above, the research goal of this paper is
to bridge the gap between hate speech prediction and the explanations generated
by the system to support its decision. This has been achieved by first
predicting the classification of a text and then providing a posthoc, model
agnostic and surrogate interpretability approach for explainability and to
prevent model bias. The bidirectional transformer model BERT has been used for
prediction because of its state of the art efficiency over other Machine
Learning models. The model agnostic algorithm LIME generates explanations for
the output of a trained classifier and predicts the features that influence the
model decision. The predictions generated from the model were evaluated
manually, and after thorough evaluation, we observed that the model performs
efficiently in predicting and explaining its prediction. Lastly, we suggest
further directions for the expansion of the provided research work.
Related papers
- Counterfactuals As a Means for Evaluating Faithfulness of Attribution Methods in Autoregressive Language Models [6.394084132117747]
We propose a technique that leverages counterfactual generation to evaluate the faithfulness of attribution methods for autoregressive language models.
Our technique generates fluent, in-distribution counterfactuals, making the evaluation protocol more reliable.
arXiv Detail & Related papers (2024-08-21T00:17:59Z) - Explaining Text Similarity in Transformer Models [52.571158418102584]
Recent advances in explainable AI have made it possible to mitigate limitations by leveraging improved explanations for Transformers.
We use BiLRP, an extension developed for computing second-order explanations in bilinear similarity models, to investigate which feature interactions drive similarity in NLP models.
Our findings contribute to a deeper understanding of different semantic similarity tasks and models, highlighting how novel explainable AI methods enable in-depth analyses and corpus-level insights.
arXiv Detail & Related papers (2024-05-10T17:11:31Z) - Explainability for Large Language Models: A Survey [59.67574757137078]
Large language models (LLMs) have demonstrated impressive capabilities in natural language processing.
This paper introduces a taxonomy of explainability techniques and provides a structured overview of methods for explaining Transformer-based language models.
arXiv Detail & Related papers (2023-09-02T22:14:26Z) - Rationalizing Predictions by Adversarial Information Calibration [65.19407304154177]
We train two models jointly: one is a typical neural model that solves the task at hand in an accurate but black-box manner, and the other is a selector-predictor model that additionally produces a rationale for its prediction.
We use an adversarial technique to calibrate the information extracted by the two models such that the difference between them is an indicator of the missed or over-selected features.
arXiv Detail & Related papers (2023-01-15T03:13:09Z) - Pathologies of Pre-trained Language Models in Few-shot Fine-tuning [50.3686606679048]
We show that pre-trained language models with few examples show strong prediction bias across labels.
Although few-shot fine-tuning can mitigate the prediction bias, our analysis shows models gain performance improvement by capturing non-task-related features.
These observations alert that pursuing model performance with fewer examples may incur pathological prediction behavior.
arXiv Detail & Related papers (2022-04-17T15:55:18Z) - Explain, Edit, and Understand: Rethinking User Study Design for
Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews.
We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z) - Tree-based local explanations of machine learning model predictions,
AraucanaXAI [2.9660372210786563]
A tradeoff between performance and intelligibility is often to be faced, especially in high-stakes applications like medicine.
We propose a novel methodological approach for generating explanations of the predictions of a generic ML model.
arXiv Detail & Related papers (2021-10-15T17:39:19Z) - On the Lack of Robust Interpretability of Neural Text Classifiers [14.685352584216757]
We assess the robustness of interpretations of neural text classifiers based on pretrained Transformer encoders.
Both tests show surprising deviations from expected behavior, raising questions about the extent of insights that practitioners may draw from interpretations.
arXiv Detail & Related papers (2021-06-08T18:31:02Z) - Beyond Trivial Counterfactual Explanations with Diverse Valuable
Explanations [64.85696493596821]
In computer vision applications, generative counterfactual methods indicate how to perturb a model's input to change its prediction.
We propose a counterfactual method that learns a perturbation in a disentangled latent space that is constrained using a diversity-enforcing loss.
Our model improves the success rate of producing high-quality valuable explanations when compared to previous state-of-the-art methods.
arXiv Detail & Related papers (2021-03-18T12:57:34Z) - Explain and Predict, and then Predict Again [6.865156063241553]
We propose ExPred, that uses multi-task learning in the explanation generation phase effectively trading-off explanation and prediction losses.
We conduct an extensive evaluation of our approach on three diverse language datasets.
arXiv Detail & Related papers (2021-01-11T19:36:52Z) - CausaLM: Causal Model Explanation Through Counterfactual Language Models [33.29636213961804]
CausaLM is a framework for producing causal model explanations using counterfactual language representation models.
We show that language representation models such as BERT can effectively learn a counterfactual representation for a given concept of interest.
A byproduct of our method is a language representation model that is unaffected by the tested concept.
arXiv Detail & Related papers (2020-05-27T15:06:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.