e-ViL: A Dataset and Benchmark for Natural Language Explanations in
Vision-Language Tasks
- URL: http://arxiv.org/abs/2105.03761v1
- Date: Sat, 8 May 2021 18:46:33 GMT
- Title: e-ViL: A Dataset and Benchmark for Natural Language Explanations in
Vision-Language Tasks
- Authors: Maxime Kayser, Oana-Maria Camburu, Leonard Salewski, Cornelius Emde,
Virginie Do, Zeynep Akata, Thomas Lukasiewicz
- Abstract summary: We introduce e-ViL, a benchmark for evaluate explainable vision-language tasks.
We also introduce e-SNLI-VE, the largest existing dataset with NLEs.
We propose a new model that combines UNITER, which learns joint embeddings of images and text, and GPT-2, a pre-trained language model.
- Score: 52.918087305406296
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, an increasing number of works have introduced models capable of
generating natural language explanations (NLEs) for their predictions on
vision-language (VL) tasks. Such models are appealing because they can provide
human-friendly and comprehensive explanations. However, there is still a lack
of unified evaluation approaches for the explanations generated by these
models. Moreover, there are currently only few datasets of NLEs for VL tasks.
In this work, we introduce e-ViL, a benchmark for explainable vision-language
tasks that establishes a unified evaluation framework and provides the first
comprehensive comparison of existing approaches that generate NLEs for VL
tasks. e-ViL spans four models and three datasets. Both automatic metrics and
human evaluation are used to assess model-generated explanations. We also
introduce e-SNLI-VE, the largest existing VL dataset with NLEs (over 430k
instances). Finally, we propose a new model that combines UNITER, which learns
joint embeddings of images and text, and GPT-2, a pre-trained language model
that is well-suited for text generation. It surpasses the previous
state-of-the-art by a large margin across all datasets.
Related papers
- Automated Data Visualization from Natural Language via Large Language Models: An Exploratory Study [41.84915013818794]
The Natural Language to Visualization (NL2Vis) task aims to transform natural-language descriptions into visual representations for a grounded table.
Many deep learning-based approaches have been developed for NL2Vis, but challenges persist in visualizing data sourced from unseen databases or spanning multiple tables.
Taking inspiration from the remarkable generation capabilities of Large Language Models (LLMs), this paper conducts an empirical study to evaluate their potential in generating visualizations.
arXiv Detail & Related papers (2024-04-26T03:25:35Z) - An Open Dataset and Model for Language Identification [84.15194457400253]
We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033 across 201 languages.
We make both the model and the dataset available to the research community.
arXiv Detail & Related papers (2023-05-23T08:43:42Z) - Going Beyond Nouns With Vision & Language Models Using Synthetic Data [43.87754926411406]
Large-scale pre-trained Vision & Language (VL) models have shown remarkable performance in many applications.
Recent works have uncovered a fundamental weakness of these models.
We investigate to which extent purely synthetic data could be leveraged to teach these models to overcome such shortcomings.
arXiv Detail & Related papers (2023-03-30T17:57:43Z) - Harnessing the Power of Multi-Task Pretraining for Ground-Truth Level
Natural Language Explanations [12.757277574843101]
Natural language explanations promise to offer intuitively understandable explanations of a neural network's decision process in complex vision-language tasks.
Current models offer impressive performance on task accuracy and explanation plausibility, but suffer from a range of issues.
We apply recent advances in large-scale multi-task pretraining of generative Transformer models to the problem of VL-NLE tasks.
Our approach outperforms recent models by a large margin, with human annotators preferring the generated explanations over the ground truth in two out of three evaluated datasets.
arXiv Detail & Related papers (2022-12-08T12:28:23Z) - VL-CheckList: Evaluating Pre-trained Vision-Language Models with
Objects, Attributes and Relations [28.322824790738768]
Vision-Language Pretraining models have successfully facilitated many cross-modal downstream tasks.
Most existing works evaluated their systems by comparing the fine-tuned downstream task performance.
Inspired by the CheckList for testing natural language processing, we exploit VL-CheckList, a novel framework.
arXiv Detail & Related papers (2022-07-01T06:25:53Z) - ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented
Visual Models [102.63817106363597]
We build ELEVATER, the first benchmark to compare and evaluate pre-trained language-augmented visual models.
It consists of 20 image classification datasets and 35 object detection datasets, each of which is augmented with external knowledge.
We will release our toolkit and evaluation platforms for the research community.
arXiv Detail & Related papers (2022-04-19T10:23:42Z) - Enabling Multimodal Generation on CLIP via Vision-Language Knowledge
Distillation [79.72299298976525]
We propose to augment a vision-language pre-training model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD)
Experiments show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning.
The original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.
arXiv Detail & Related papers (2022-03-12T09:33:37Z) - VinVL: Revisiting Visual Representations in Vision-Language Models [96.39332942534368]
We develop an improved object detection model to provide object-centric representations of images.
New visual features significantly improve the performance across all vision language (VL) tasks.
We will release the new object detection model to public.
arXiv Detail & Related papers (2021-01-02T23:35:27Z) - Learning Contextual Representations for Semantic Parsing with
Generation-Augmented Pre-Training [86.91380874390778]
We present Generation-Augmented Pre-training (GAP), that jointly learns representations of natural language utterances and table schemas by leveraging generation models to generate pre-train data.
Based on experimental results, neural semantics that leverage GAP MODEL obtain new state-of-the-art results on both SPIDER and CRITERIA-TO-generative benchmarks.
arXiv Detail & Related papers (2020-12-18T15:53:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.