Visual Programming: Compositional visual reasoning without training
- URL: http://arxiv.org/abs/2211.11559v1
- Date: Fri, 18 Nov 2022 18:50:09 GMT
- Title: Visual Programming: Compositional visual reasoning without training
- Authors: Tanmay Gupta and Aniruddha Kembhavi
- Abstract summary: VISPROG is a neuro-symbolic approach to solving complex and compositional visual tasks.
It uses the in-context learning ability of large language models to generate python-like modular programs.
- Score: 24.729624386851388
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present VISPROG, a neuro-symbolic approach to solving complex and
compositional visual tasks given natural language instructions. VISPROG avoids
the need for any task-specific training. Instead, it uses the in-context
learning ability of large language models to generate python-like modular
programs, which are then executed to get both the solution and a comprehensive
and interpretable rationale. Each line of the generated program may invoke one
of several off-the-shelf computer vision models, image processing routines, or
python functions to produce intermediate outputs that may be consumed by
subsequent parts of the program. We demonstrate the flexibility of VISPROG on 4
diverse tasks - compositional visual question answering, zero-shot reasoning on
image pairs, factual knowledge object tagging, and language-guided image
editing. We believe neuro-symbolic approaches like VISPROG are an exciting
avenue to easily and effectively expand the scope of AI systems to serve the
long tail of complex tasks that people may wish to perform.
Related papers
- VoxelPrompt: A Vision-Language Agent for Grounded Medical Image Analysis [9.937830036053871]
VoxelPrompt tackles diverse radiological tasks through joint modeling of natural language, image volumes, and analytical metrics.
We show that VoxelPrompt can delineate hundreds of anatomical and pathological features, measure many complex morphological properties, and perform open-language analysis of lesion characteristics.
arXiv Detail & Related papers (2024-10-10T22:11:43Z) - PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions [66.92809850624118]
PixWizard is an image-to-image visual assistant designed for image generation, manipulation, and translation based on free-from language instructions.
We tackle a variety of vision tasks into a unified image-text-to-image generation framework and curate an Omni Pixel-to-Pixel Instruction-Tuning dataset.
Our experiments demonstrate that PixWizard not only shows impressive generative and understanding abilities for images with diverse resolutions but also exhibits promising generalization capabilities with unseen tasks and human instructions.
arXiv Detail & Related papers (2024-09-23T17:59:46Z) - Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement [93.73648674743097]
Visual program synthesis is a promising approach to exploit the reasoning abilities of large language models for compositional computer vision tasks.
Previous work has used few-shot prompting with frozen LLMs to synthesize visual programs.
No dataset of visual programs for training exists, and acquisition of a visual program dataset cannot be easily crowdsourced.
arXiv Detail & Related papers (2024-04-06T13:25:00Z) - De-fine: Decomposing and Refining Visual Programs with Auto-Feedback [75.62712247421146]
De-fine is a training-free framework that decomposes complex tasks into simpler subtasks and refines programs through auto-feedback.
Our experiments across various visual tasks show that De-fine creates more robust programs.
arXiv Detail & Related papers (2023-11-21T06:24:09Z) - InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists [66.85125112199898]
We develop a unified language interface for computer vision tasks that abstracts away task-specific design choices.
Our model, dubbed InstructCV, performs competitively compared to other generalist and task-specific vision models.
arXiv Detail & Related papers (2023-09-30T14:26:43Z) - Image Manipulation via Multi-Hop Instructions -- A New Dataset and
Weakly-Supervised Neuro-Symbolic Approach [31.380435286215757]
We are interested in image manipulation via natural language text.
Our system referred to as NeuroSIM can perform complex multi-hop reasoning over multi-object scenes.
arXiv Detail & Related papers (2023-05-23T17:59:10Z) - ViperGPT: Visual Inference via Python Execution for Reasoning [23.56704214763551]
We introduce ViperGPT, a framework that composes vision-and-language models into subroutines to produce a result for any query.
This simple approach requires no further training, and achieves state-of-the-art results across various complex visual tasks.
arXiv Detail & Related papers (2023-03-14T17:57:47Z) - Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks.
For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs.
Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z) - Images Speak in Images: A Generalist Painter for In-Context Visual
Learning [98.78475432114595]
In-context learning allows the model to rapidly adapt to various tasks with only a handful of prompts and examples.
It is unclear how to define the general-purpose task prompts that the vision model can understand and transfer to out-of-domain tasks.
We present Painter, a generalist model which redefines the output of core vision tasks as images, and specify task prompts as also images.
arXiv Detail & Related papers (2022-12-05T18:59:50Z) - Learning compositional programs with arguments and sampling [12.790055619773565]
We train a machine learning model to discover a program that satisfies specific requirements.
We extend upon a state of the art model, AlphaNPI, by learning to generate functions that can accept arguments.
arXiv Detail & Related papers (2021-09-01T21:27:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.