Related papers: NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks

NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks

URL: http://arxiv.org/abs/2203.05081v1
Date: Wed, 9 Mar 2022 22:57:15 GMT
Title: NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks
Authors: Fawaz Sammani, Tanmoy Mukherjee, Nikos Deligiannis
Abstract summary: Natural language explanation (NLE) models aim at explaining the decision-making process of a black box system. We introduce NLX-GPT, a general, compact and faithful language model that can simultaneously predict an answer and explain it. We then address the problem of evaluating the explanations which can be in many times generic, data-biased and can come in several forms.
Score: 18.13793282306575
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Natural language explanation (NLE) models aim at explaining the decision-making process of a black box system via generating natural language sentences which are human-friendly, high-level and fine-grained. Current NLE models explain the decision-making process of a vision or vision-language model (a.k.a., task model), e.g., a VQA model, via a language model (a.k.a., explanation model), e.g., GPT. Other than the additional memory resources and inference time required by the task model, the task and explanation models are completely independent, which disassociates the explanation from the reasoning process made to predict the answer. We introduce NLX-GPT, a general, compact and faithful language model that can simultaneously predict an answer and explain it. We first conduct pre-training on large scale data of image-caption pairs for general understanding of images, and then formulate the answer as a text prediction task along with the explanation. Without region proposals nor a task model, our resulting overall framework attains better evaluation scores, contains much less parameters and is 15$\times$ faster than the current SoA model. We then address the problem of evaluating the explanations which can be in many times generic, data-biased and can come in several forms. We therefore design 2 new evaluation measures: (1) explain-predict and (2) retrieval-based attack, a self-evaluation framework that requires no labels. Code is at: https://github.com/fawazsammani/nlxgpt.

Related papers

GOFA: A Generative One-For-All Model for Joint Graph Language Modeling [38.267339613261996]
We propose a novel generative graph language model GOFA to solve the problem. GOFA is pre-trained on newly proposed graph-level next-word prediction, question-answering, and structural tasks. The model is evaluated on various downstream tasks, demonstrating a strong ability to solve structural and contextual problems in zero-shot scenarios.
arXiv Detail & Related papers (2024-07-12T22:23:51Z)
Zero-shot Translation of Attention Patterns in VQA Models to Natural Language [65.94419474119162]
ZS-A2T is a framework that translates the transformer attention of a given model into natural language without requiring any training. We consider this in the context of Visual Question Answering (VQA) Our framework does not require any training and allows the drop-in replacement of different guiding sources.
arXiv Detail & Related papers (2023-11-08T22:18:53Z)
Zero-shot Visual Question Answering with Language Model Feedback [83.65140324876536]
We propose a language model guided captioning approach, LAMOC, for knowledge-based visual question answering (VQA) Our approach employs the generated captions by a captioning model as the context of an answer prediction model, which is a Pre-trained Language model (PLM)
arXiv Detail & Related papers (2023-05-26T15:04:20Z)
A Unified Understanding of Deep NLP Models for Text Classification [88.35418976241057]
We have developed a visual analysis tool, DeepNLPVis, to enable a unified understanding of NLP models for text classification. The key idea is a mutual information-based measure, which provides quantitative explanations on how each layer of a model maintains the information of input words in a sample. A multi-level visualization, which consists of a corpus-level, a sample-level, and a word-level visualization, supports the analysis from the overall training set to individual samples.
arXiv Detail & Related papers (2022-06-19T08:55:07Z)
On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs. Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z)
e-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language Tasks [52.918087305406296]
We introduce e-ViL, a benchmark for evaluate explainable vision-language tasks. We also introduce e-SNLI-VE, the largest existing dataset with NLEs. We propose a new model that combines UNITER, which learns joint embeddings of images and text, and GPT-2, a pre-trained language model.
arXiv Detail & Related papers (2021-05-08T18:46:33Z)
Self-Explaining Structures Improve NLP Models [25.292847674586614]
We propose a simple yet general and effective self-explaining framework for deep learning models in NLP. We show that interpretability does not come at the cost of performance: a neural model of self-explaining features obtains better performances than its counterpart without the self-explaining nature.
arXiv Detail & Related papers (2020-12-03T09:32:05Z)
LRTA: A Transparent Neural-Symbolic Reasoning Framework with Modular Supervision for Visual Question Answering [4.602329567377897]
We propose a transparent neural-symbolic reasoning framework for visual question answering. It solves the problem step-by-step like humans and provides human-readable form of justification at each step. Our experiments on GQA dataset show that LRTA outperforms the state-of-the-art model by a large margin.
arXiv Detail & Related papers (2020-11-21T06:39:42Z)
Leakage-Adjusted Simulatability: Can Models Generate Non-Trivial Explanations of Their Behavior in Natural Language? [86.60613602337246]
We introduce a leakage-adjusted simulatability (LAS) metric for evaluating NL explanations. LAS measures how well explanations help an observer predict a model's output, while controlling for how explanations can directly leak the output. We frame explanation generation as a multi-agent game and optimize explanations for simulatability while penalizing label leakage.
arXiv Detail & Related papers (2020-10-08T16:59:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.