ViperGPT: Visual Inference via Python Execution for Reasoning
- URL: http://arxiv.org/abs/2303.08128v1
- Date: Tue, 14 Mar 2023 17:57:47 GMT
- Title: ViperGPT: Visual Inference via Python Execution for Reasoning
- Authors: D\'idac Sur\'is and Sachit Menon and Carl Vondrick
- Abstract summary: We introduce ViperGPT, a framework that composes vision-and-language models into subroutines to produce a result for any query.
This simple approach requires no further training, and achieves state-of-the-art results across various complex visual tasks.
- Score: 23.56704214763551
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Answering visual queries is a complex task that requires both visual
processing and reasoning. End-to-end models, the dominant approach for this
task, do not explicitly differentiate between the two, limiting
interpretability and generalization. Learning modular programs presents a
promising alternative, but has proven challenging due to the difficulty of
learning both the programs and modules simultaneously. We introduce ViperGPT, a
framework that leverages code-generation models to compose vision-and-language
models into subroutines to produce a result for any query. ViperGPT utilizes a
provided API to access the available modules, and composes them by generating
Python code that is later executed. This simple approach requires no further
training, and achieves state-of-the-art results across various complex visual
tasks.
Related papers
- De-fine: Decomposing and Refining Visual Programs with Auto-Feedback [75.62712247421146]
De-fine is a training-free framework that decomposes complex tasks into simpler subtasks and refines programs through auto-feedback.
Our experiments across various visual tasks show that De-fine creates more robust programs.
arXiv Detail & Related papers (2023-11-21T06:24:09Z) - Analyzing Modular Approaches for Visual Question Decomposition [38.73070270272822]
Modular neural networks without additional training have recently been shown to surpass end-to-end neural networks on vision-language tasks.
This paper asks where its additional performance comes from and how much is due to the (state-of-art, end-to-end) BLIP-2 model it subsumes vs. additional symbolic components.
We find that ViperGPT's reported gains over BLIP-2 can be attributed to its selection of task-specific modules, and when we run ViperGPT using a more task-agnostic selection of modules, these gains go away.
arXiv Detail & Related papers (2023-11-10T22:14:26Z) - GENOME: GenerativE Neuro-symbOlic visual reasoning by growing and
reusing ModulEs [64.49176353858792]
We propose generative neuro-symbolic visual reasoning by growing and reusing modules.
The proposed model performs competitively on standard tasks like visual question answering and referring expression comprehension.
It is able to adapt to new visual reasoning tasks by observing a few training examples and reusing modules.
arXiv Detail & Related papers (2023-11-08T18:59:05Z) - MiniGPT-v2: large language model as a unified interface for
vision-language multi-task learning [65.60607895153692]
MiniGPT-v2 is a model that can be treated as a unified interface for better handling various vision-language tasks.
We propose using unique identifiers for different tasks when training the model.
Our results show that MiniGPT-v2 achieves strong performance on many visual question-answering and visual grounding benchmarks.
arXiv Detail & Related papers (2023-10-14T03:22:07Z) - Modular Visual Question Answering via Code Generation [134.59005611826777]
We present a framework that formulates visual question answering as modular code generation.
Our approach requires no additional training and relies on pre-trained language models (LMs), visual models pre-trained on image-caption pairs, and fifty VQA examples used for in-context learning.
Our approach improves accuracy on the COVR dataset by at least 3% and on the GQA dataset by roughly 2% compared to the few-shot baseline that does not employ code generation.
arXiv Detail & Related papers (2023-06-08T17:45:14Z) - Visual Programming: Compositional visual reasoning without training [24.729624386851388]
VISPROG is a neuro-symbolic approach to solving complex and compositional visual tasks.
It uses the in-context learning ability of large language models to generate python-like modular programs.
arXiv Detail & Related papers (2022-11-18T18:50:09Z) - Flamingo: a Visual Language Model for Few-Shot Learning [95.88782798074314]
We introduce Flamingo, a family of Visual Language Models (VLM) with this ability.
Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora.
We demonstrate that a single Flamingo model can achieve a new state of the art for few-shot learning.
arXiv Detail & Related papers (2022-04-29T16:29:01Z) - Zero Experience Required: Plug & Play Modular Transfer Learning for
Semantic Visual Navigation [97.17517060585875]
We present a unified approach to visual navigation using a novel modular transfer learning model.
Our model can effectively leverage its experience from one source task and apply it to multiple target tasks.
Our approach learns faster, generalizes better, and outperforms SoTA models by a significant margin.
arXiv Detail & Related papers (2022-02-05T00:07:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.