Harnessing the Power of Multi-Task Pretraining for Ground-Truth Level
Natural Language Explanations
- URL: http://arxiv.org/abs/2212.04231v2
- Date: Wed, 29 Mar 2023 08:48:35 GMT
- Title: Harnessing the Power of Multi-Task Pretraining for Ground-Truth Level
Natural Language Explanations
- Authors: Bj\"orn Pl\"uster, Jakob Ambsdorf, Lukas Braach, Jae Hee Lee, Stefan
Wermter
- Abstract summary: Natural language explanations promise to offer intuitively understandable explanations of a neural network's decision process in complex vision-language tasks.
Current models offer impressive performance on task accuracy and explanation plausibility, but suffer from a range of issues.
We apply recent advances in large-scale multi-task pretraining of generative Transformer models to the problem of VL-NLE tasks.
Our approach outperforms recent models by a large margin, with human annotators preferring the generated explanations over the ground truth in two out of three evaluated datasets.
- Score: 12.757277574843101
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Natural language explanations promise to offer intuitively understandable
explanations of a neural network's decision process in complex vision-language
tasks, as pursued in recent VL-NLE models. While current models offer
impressive performance on task accuracy and explanation plausibility, they
suffer from a range of issues: Some models feature a modular design where the
explanation generation module is poorly integrated with a separate module for
task-answer prediction, employ backbone models trained on limited sets of
tasks, or incorporate ad hoc solutions to increase performance on single
datasets. We propose to evade these limitations by applying recent advances in
large-scale multi-task pretraining of generative Transformer models to the
problem of VL-NLE tasks. Our approach outperforms recent models by a large
margin, with human annotators preferring the generated explanations over the
ground truth in two out of three evaluated datasets. As a novel challenge in
VL-NLE research, we propose the problem of multi-task VL-NLE and show that
jointly training on multiple tasks can increase the explanation quality. We
discuss the ethical implications of high-quality NLE generation and other
issues in recent VL-NLE research.
Related papers
- RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks.
We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities.
We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - VL-ICL Bench: The Devil in the Details of Multimodal In-Context Learning [12.450293825734313]
Large language models (LLMs) famously exhibit emergent in-context learning (ICL)
This study introduces a benchmark VL-ICL Bench for multimodal in-context learning.
We evaluate the abilities of state-of-the-art VLLMs against this benchmark suite.
arXiv Detail & Related papers (2024-03-19T21:31:56Z) - Evaluating the Capabilities of Multi-modal Reasoning Models with
Synthetic Task Data [0.0]
We leverage advances in high resolution text-to-image generation to develop a framework for generating evaluation data for multi-modal reasoning tasks.
We apply this framework to generate context-dependent anomaly data, creating a synthetic dataset on a challenging task.
We demonstrate that while the task is tractable, the model performs significantly worse on the context-dependent anomaly detection task than on standard VQA tasks.
arXiv Detail & Related papers (2023-06-01T20:56:34Z) - Diffusion Model is an Effective Planner and Data Synthesizer for
Multi-Task Reinforcement Learning [101.66860222415512]
Multi-Task Diffusion Model (textscMTDiff) is a diffusion-based method that incorporates Transformer backbones and prompt learning for generative planning and data synthesis.
For generative planning, we find textscMTDiff outperforms state-of-the-art algorithms across 50 tasks on Meta-World and 8 maps on Maze2D.
arXiv Detail & Related papers (2023-05-29T05:20:38Z) - Enabling Multimodal Generation on CLIP via Vision-Language Knowledge
Distillation [79.72299298976525]
We propose to augment a vision-language pre-training model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD)
Experiments show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning.
The original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.
arXiv Detail & Related papers (2022-03-12T09:33:37Z) - e-ViL: A Dataset and Benchmark for Natural Language Explanations in
Vision-Language Tasks [52.918087305406296]
We introduce e-ViL, a benchmark for evaluate explainable vision-language tasks.
We also introduce e-SNLI-VE, the largest existing dataset with NLEs.
We propose a new model that combines UNITER, which learns joint embeddings of images and text, and GPT-2, a pre-trained language model.
arXiv Detail & Related papers (2021-05-08T18:46:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.