DOMINO: A Dual-System for Multi-step Visual Language Reasoning
- URL: http://arxiv.org/abs/2310.02804v1
- Date: Wed, 4 Oct 2023 13:29:47 GMT
- Title: DOMINO: A Dual-System for Multi-step Visual Language Reasoning
- Authors: Peifang Wang and Olga Golovneva and Armen Aghajanyan and Xiang Ren and
Muhao Chen and Asli Celikyilmaz and Maryam Fazel-Zarandi
- Abstract summary: We propose a dual-system for multi-step multimodal reasoning, which consists of a "System-1" step for visual information extraction and a "System-2" step for deliberate reasoning.
Our method with a pre-trained System-2 module performs competitively compared to prior work on in- and out-of-distribution data.
- Score: 76.69157235928594
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual language reasoning requires a system to extract text or numbers from
information-dense images like charts or plots and perform logical or arithmetic
reasoning to arrive at an answer. To tackle this task, existing work relies on
either (1) an end-to-end vision-language model trained on a large amount of
data, or (2) a two-stage pipeline where a captioning model converts the image
into text that is further read by another large language model to deduce the
answer. However, the former approach forces the model to answer a complex
question with one single step, and the latter approach is prone to inaccurate
or distracting information in the converted text that can confuse the language
model. In this work, we propose a dual-system for multi-step multimodal
reasoning, which consists of a "System-1" step for visual information
extraction and a "System-2" step for deliberate reasoning. Given an input,
System-2 breaks down the question into atomic sub-steps, each guiding System-1
to extract the information required for reasoning from the image. Experiments
on chart and plot datasets show that our method with a pre-trained System-2
module performs competitively compared to prior work on in- and
out-of-distribution data. By fine-tuning the System-2 module (LLaMA-2 70B) on
only a small amount of data on multi-step reasoning, the accuracy of our method
is further improved and surpasses the best fully-supervised end-to-end approach
by 5.7% and a pipeline approach with FlanPaLM (540B) by 7.5% on a challenging
dataset with human-authored questions.
Related papers
- Read and Think: An Efficient Step-wise Multimodal Language Model for Document Understanding and Reasoning [0.0]
Existing document understanding models tend to generate answers with a single word or phrase directly.
We use Multi-modal Large Language Models (MLLMs) to generate step-wise question-and-answer pairs for document images.
We then use the generated high-quality data to train a humanized document understanding and reasoning model, dubbed DocAssistant.
arXiv Detail & Related papers (2024-02-26T01:17:50Z) - Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen
Large Language Models [69.59125732317972]
We propose a simple yet effective Retrieving-to-Answer (R2A) framework for VideoQA.
R2A first retrieves a set of semantically similar texts from a generic text corpus using a pre-trained multi-modal model.
With both the question and the retrieved texts, a LLM can be directly used to yield a desired answer.
arXiv Detail & Related papers (2023-06-15T20:56:20Z) - Accountable Textual-Visual Chat Learns to Reject Human Instructions in
Image Re-creation [26.933683814025475]
We introduce two novel multimodal datasets: the synthetic CLEVR-ATVC dataset (620K) and the manually pictured Fruit-ATVC dataset (50K).
These datasets incorporate both visual and text-based inputs and outputs.
To facilitate the accountability of multimodal systems in rejecting human requests, similar to language-based ChatGPT conversations, we introduce specific rules as supervisory signals within the datasets.
arXiv Detail & Related papers (2023-03-10T15:35:11Z) - DePlot: One-shot visual language reasoning by plot-to-table translation [50.28850068391312]
This paper presents the first one-shot solution to visual language reasoning.
A modality conversion module, named as DePlot, translates the image of a plot or chart to a linearized table.
The output of DePlot can then be directly used to prompt a pretrained large language model.
arXiv Detail & Related papers (2022-12-20T18:20:50Z) - Deep Bidirectional Language-Knowledge Graph Pretraining [159.9645181522436]
DRAGON is a self-supervised approach to pretraining a deeply joint language-knowledge foundation model from text and KG at scale.
Our model takes pairs of text segments and relevant KG subgraphs as input and bidirectionally fuses information from both modalities.
arXiv Detail & Related papers (2022-10-17T18:02:52Z) - NLX-GPT: A Model for Natural Language Explanations in Vision and
Vision-Language Tasks [18.13793282306575]
Natural language explanation (NLE) models aim at explaining the decision-making process of a black box system.
We introduce NLX-GPT, a general, compact and faithful language model that can simultaneously predict an answer and explain it.
We then address the problem of evaluating the explanations which can be in many times generic, data-biased and can come in several forms.
arXiv Detail & Related papers (2022-03-09T22:57:15Z) - XRJL-HKUST at SemEval-2021 Task 4: WordNet-Enhanced Dual Multi-head
Co-Attention for Reading Comprehension of Abstract Meaning [6.55600662108243]
This paper presents our submitted system to SemEval 2021 Task 4: Reading of Abstract Meaning.
Our system uses a large pre-trained language model as the encoder and an additional dual multi-head co-attention layer to strengthen the relationship between passages and question-answer pairs.
Our system, called WordNet-enhanced DUal Multi-head Co-Attention (WN-DUMA), achieves 86.67% and 89.99% accuracy on the official blind test set of subtask 1 and subtask 2 respectively.
arXiv Detail & Related papers (2021-03-30T06:22:58Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Language Models as Few-Shot Learner for Task-Oriented Dialogue Systems [74.8759568242933]
Task-oriented dialogue systems use four connected modules, namely, Natural Language Understanding (NLU), a Dialogue State Tracking (DST), Dialogue Policy (DP) and Natural Language Generation (NLG)
A research challenge is to learn each module with the least amount of samples given the high cost related to the data collection.
We evaluate the priming few-shot ability of language models in the NLU, DP and NLG tasks.
arXiv Detail & Related papers (2020-08-14T08:23:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.