NLX-GPT: A Model for Natural Language Explanations in Vision and
Vision-Language Tasks
- URL: http://arxiv.org/abs/2203.05081v1
- Date: Wed, 9 Mar 2022 22:57:15 GMT
- Title: NLX-GPT: A Model for Natural Language Explanations in Vision and
Vision-Language Tasks
- Authors: Fawaz Sammani, Tanmoy Mukherjee, Nikos Deligiannis
- Abstract summary: Natural language explanation (NLE) models aim at explaining the decision-making process of a black box system.
We introduce NLX-GPT, a general, compact and faithful language model that can simultaneously predict an answer and explain it.
We then address the problem of evaluating the explanations which can be in many times generic, data-biased and can come in several forms.
- Score: 18.13793282306575
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Natural language explanation (NLE) models aim at explaining the
decision-making process of a black box system via generating natural language
sentences which are human-friendly, high-level and fine-grained. Current NLE
models explain the decision-making process of a vision or vision-language model
(a.k.a., task model), e.g., a VQA model, via a language model (a.k.a.,
explanation model), e.g., GPT. Other than the additional memory resources and
inference time required by the task model, the task and explanation models are
completely independent, which disassociates the explanation from the reasoning
process made to predict the answer. We introduce NLX-GPT, a general, compact
and faithful language model that can simultaneously predict an answer and
explain it. We first conduct pre-training on large scale data of image-caption
pairs for general understanding of images, and then formulate the answer as a
text prediction task along with the explanation. Without region proposals nor a
task model, our resulting overall framework attains better evaluation scores,
contains much less parameters and is 15$\times$ faster than the current SoA
model. We then address the problem of evaluating the explanations which can be
in many times generic, data-biased and can come in several forms. We therefore
design 2 new evaluation measures: (1) explain-predict and (2) retrieval-based
attack, a self-evaluation framework that requires no labels. Code is at:
https://github.com/fawazsammani/nlxgpt.
Related papers
- GOFA: A Generative One-For-All Model for Joint Graph Language Modeling [38.267339613261996]
We propose a novel generative graph language model GOFA to solve the problem.
GOFA is pre-trained on newly proposed graph-level next-word prediction, question-answering, and structural tasks.
The model is evaluated on various downstream tasks, demonstrating a strong ability to solve structural and contextual problems in zero-shot scenarios.
arXiv Detail & Related papers (2024-07-12T22:23:51Z) - Zero-shot Translation of Attention Patterns in VQA Models to Natural
Language [65.94419474119162]
ZS-A2T is a framework that translates the transformer attention of a given model into natural language without requiring any training.
We consider this in the context of Visual Question Answering (VQA)
Our framework does not require any training and allows the drop-in replacement of different guiding sources.
arXiv Detail & Related papers (2023-11-08T22:18:53Z) - Zero-shot Visual Question Answering with Language Model Feedback [83.65140324876536]
We propose a language model guided captioning approach, LAMOC, for knowledge-based visual question answering (VQA)
Our approach employs the generated captions by a captioning model as the context of an answer prediction model, which is a Pre-trained Language model (PLM)
arXiv Detail & Related papers (2023-05-26T15:04:20Z) - A Unified Understanding of Deep NLP Models for Text Classification [88.35418976241057]
We have developed a visual analysis tool, DeepNLPVis, to enable a unified understanding of NLP models for text classification.
The key idea is a mutual information-based measure, which provides quantitative explanations on how each layer of a model maintains the information of input words in a sample.
A multi-level visualization, which consists of a corpus-level, a sample-level, and a word-level visualization, supports the analysis from the overall training set to individual samples.
arXiv Detail & Related papers (2022-06-19T08:55:07Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - e-ViL: A Dataset and Benchmark for Natural Language Explanations in
Vision-Language Tasks [52.918087305406296]
We introduce e-ViL, a benchmark for evaluate explainable vision-language tasks.
We also introduce e-SNLI-VE, the largest existing dataset with NLEs.
We propose a new model that combines UNITER, which learns joint embeddings of images and text, and GPT-2, a pre-trained language model.
arXiv Detail & Related papers (2021-05-08T18:46:33Z) - Self-Explaining Structures Improve NLP Models [25.292847674586614]
We propose a simple yet general and effective self-explaining framework for deep learning models in NLP.
We show that interpretability does not come at the cost of performance: a neural model of self-explaining features obtains better performances than its counterpart without the self-explaining nature.
arXiv Detail & Related papers (2020-12-03T09:32:05Z) - LRTA: A Transparent Neural-Symbolic Reasoning Framework with Modular
Supervision for Visual Question Answering [4.602329567377897]
We propose a transparent neural-symbolic reasoning framework for visual question answering.
It solves the problem step-by-step like humans and provides human-readable form of justification at each step.
Our experiments on GQA dataset show that LRTA outperforms the state-of-the-art model by a large margin.
arXiv Detail & Related papers (2020-11-21T06:39:42Z) - Leakage-Adjusted Simulatability: Can Models Generate Non-Trivial
Explanations of Their Behavior in Natural Language? [86.60613602337246]
We introduce a leakage-adjusted simulatability (LAS) metric for evaluating NL explanations.
LAS measures how well explanations help an observer predict a model's output, while controlling for how explanations can directly leak the output.
We frame explanation generation as a multi-agent game and optimize explanations for simulatability while penalizing label leakage.
arXiv Detail & Related papers (2020-10-08T16:59:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.