DePlot: One-shot visual language reasoning by plot-to-table translation
- URL: http://arxiv.org/abs/2212.10505v2
- Date: Tue, 23 May 2023 18:28:39 GMT
- Title: DePlot: One-shot visual language reasoning by plot-to-table translation
- Authors: Fangyu Liu, Julian Martin Eisenschlos, Francesco Piccinno, Syrine
Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier,
Yasemin Altun
- Abstract summary: This paper presents the first one-shot solution to visual language reasoning.
A modality conversion module, named as DePlot, translates the image of a plot or chart to a linearized table.
The output of DePlot can then be directly used to prompt a pretrained large language model.
- Score: 50.28850068391312
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual language such as charts and plots is ubiquitous in the human world.
Comprehending plots and charts requires strong reasoning skills. Prior
state-of-the-art (SOTA) models require at least tens of thousands of training
examples and their reasoning capabilities are still much limited, especially on
complex human-written queries. This paper presents the first one-shot solution
to visual language reasoning. We decompose the challenge of visual language
reasoning into two steps: (1) plot-to-text translation, and (2) reasoning over
the translated text. The key in this method is a modality conversion module,
named as DePlot, which translates the image of a plot or chart to a linearized
table. The output of DePlot can then be directly used to prompt a pretrained
large language model (LLM), exploiting the few-shot reasoning capabilities of
LLMs. To obtain DePlot, we standardize the plot-to-table task by establishing
unified task formats and metrics, and train DePlot end-to-end on this task.
DePlot can then be used off-the-shelf together with LLMs in a plug-and-play
fashion. Compared with a SOTA model finetuned on more than >28k data points,
DePlot+LLM with just one-shot prompting achieves a 24.0% improvement over
finetuned SOTA on human-written queries from the task of chart QA.
Related papers
- On Pre-training of Multimodal Language Models Customized for Chart Understanding [83.99377088129282]
This paper explores the training processes necessary to improve MLLMs' comprehension of charts.
We introduce CHOPINLLM, an MLLM tailored for in-depth chart comprehension.
arXiv Detail & Related papers (2024-07-19T17:58:36Z) - SIMPLOT: Enhancing Chart Question Answering by Distilling Essentials [15.522722875552892]
We introduce SIMPLOT, a method designed to extract only the elements necessary for chart reasoning.
Our model enables accurate chart reasoning without the need for additional annotations or datasets.
arXiv Detail & Related papers (2024-02-22T14:04:22Z) - GraphTranslator: Aligning Graph Model to Large Language Model for
Open-ended Tasks [44.02825843494608]
Large language models (LLMs) like ChatGPT, exhibit powerful zero-shot and instruction-following capabilities.
GraphTranslator aims to leverage GM to handle the pre-defined tasks effectively.
By translating node representation into tokens, GraphTranslator empowers an LLM to make predictions based on language instructions.
arXiv Detail & Related papers (2024-02-11T13:24:13Z) - DOMINO: A Dual-System for Multi-step Visual Language Reasoning [76.69157235928594]
We propose a dual-system for multi-step multimodal reasoning, which consists of a "System-1" step for visual information extraction and a "System-2" step for deliberate reasoning.
Our method with a pre-trained System-2 module performs competitively compared to prior work on in- and out-of-distribution data.
arXiv Detail & Related papers (2023-10-04T13:29:47Z) - GenPlot: Increasing the Scale and Diversity of Chart Derendering Data [0.0]
We propose GenPlot, a plot generator that can generate billions of additional plots for chart-derendering using synthetic data.
OCR-free chart-to-text translation has achieved state-of-the-art results on visual language tasks.
arXiv Detail & Related papers (2023-06-20T17:25:53Z) - ChartReader: A Unified Framework for Chart Derendering and Comprehension
without Heuristic Rules [89.75395046894809]
We present ChartReader, a unified framework that seamlessly integrates chart derendering and comprehension tasks.
Our approach includes a transformer-based chart component detection module and an extended pre-trained vision-language model for chart-to-X tasks.
Our proposed framework can significantly reduce the manual effort involved in chart analysis, providing a step towards a universal chart understanding model.
arXiv Detail & Related papers (2023-04-05T00:25:27Z) - BLIP: Bootstrapping Language-Image Pre-training for Unified
Vision-Language Understanding and Generation [86.4572981982407]
We propose BLIP, a new vision-language framework which transfers flexibly to both vision-language understanding and generation tasks.
BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones.
BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner.
arXiv Detail & Related papers (2022-01-28T12:49:48Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.