Related papers: Visual Programming: Compositional visual reasoning without training

Visual Programming: Compositional visual reasoning without training

URL: http://arxiv.org/abs/2211.11559v1
Date: Fri, 18 Nov 2022 18:50:09 GMT
Title: Visual Programming: Compositional visual reasoning without training
Authors: Tanmay Gupta and Aniruddha Kembhavi
Abstract summary: VISPROG is a neuro-symbolic approach to solving complex and compositional visual tasks. It uses the in-context learning ability of large language models to generate python-like modular programs.
Score: 24.729624386851388
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present VISPROG, a neuro-symbolic approach to solving complex and compositional visual tasks given natural language instructions. VISPROG avoids the need for any task-specific training. Instead, it uses the in-context learning ability of large language models to generate python-like modular programs, which are then executed to get both the solution and a comprehensive and interpretable rationale. Each line of the generated program may invoke one of several off-the-shelf computer vision models, image processing routines, or python functions to produce intermediate outputs that may be consumed by subsequent parts of the program. We demonstrate the flexibility of VISPROG on 4 diverse tasks - compositional visual question answering, zero-shot reasoning on image pairs, factual knowledge object tagging, and language-guided image editing. We believe neuro-symbolic approaches like VISPROG are an exciting avenue to easily and effectively expand the scope of AI systems to serve the long tail of complex tasks that people may wish to perform.

Related papers

Visual Graph Question Answering with ASP and LLMs for Language Parsing [10.012129232671635]
We address the problem of how to integrate ASP with modules for vision and natural language processing to solve a new and demanding VQA variant. Our modular neuro-symbolic approach combines optical graph recognition for graph parsing, a pretrained optical character recognition neural network for parsing labels, Large Language Models (LLMs) for language processing, and ASP for reasoning.
arXiv Detail & Related papers (2025-02-13T11:47:59Z)
Interactive Sketchpad: A Multimodal Tutoring System for Collaborative, Visual Problem-Solving [25.22658210339668]
This paper introduces Interactive Sketchpad, a tutoring system that combines language-based explanations with interactive visualizations to enhance learning. User studies conducted on math problems such as geometry, calculus, and demonstrate that Interactive Sketchpad leads to improved task comprehension, problem-solving accuracy, and engagement levels.
arXiv Detail & Related papers (2025-02-12T00:59:25Z)
VoxelPrompt: A Vision-Language Agent for Grounded Medical Image Analysis [9.937830036053871]
VoxelPrompt tackles diverse radiological tasks through joint modeling of natural language, image volumes, and analytical metrics. We show that VoxelPrompt can delineate hundreds of anatomical and pathological features, measure many complex morphological properties, and perform open-language analysis of lesion characteristics.
arXiv Detail & Related papers (2024-10-10T22:11:43Z)
PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions [66.92809850624118]
PixWizard is an image-to-image visual assistant designed for image generation, manipulation, and translation based on free-from language instructions. We tackle a variety of vision tasks into a unified image-text-to-image generation framework and curate an Omni Pixel-to-Pixel Instruction-Tuning dataset. Our experiments demonstrate that PixWizard not only shows impressive generative and understanding abilities for images with diverse resolutions but also exhibits promising generalization capabilities with unseen tasks and human instructions.
arXiv Detail & Related papers (2024-09-23T17:59:46Z)
Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement [93.73648674743097]
Visual program synthesis is a promising approach to exploit the reasoning abilities of large language models for compositional computer vision tasks. Previous work has used few-shot prompting with frozen LLMs to synthesize visual programs. No dataset of visual programs for training exists, and acquisition of a visual program dataset cannot be easily crowdsourced.
arXiv Detail & Related papers (2024-04-06T13:25:00Z)
De-fine: Decomposing and Refining Visual Programs with Auto-Feedback [75.62712247421146]
De-fine is a training-free framework that decomposes complex tasks into simpler subtasks and refines programs through auto-feedback. Our experiments across various visual tasks show that De-fine creates more robust programs.
arXiv Detail & Related papers (2023-11-21T06:24:09Z)
InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists [66.85125112199898]
We develop a unified language interface for computer vision tasks that abstracts away task-specific design choices. Our model, dubbed InstructCV, performs competitively compared to other generalist and task-specific vision models.
arXiv Detail & Related papers (2023-09-30T14:26:43Z)
Image Manipulation via Multi-Hop Instructions -- A New Dataset and Weakly-Supervised Neuro-Symbolic Approach [31.380435286215757]
We are interested in image manipulation via natural language text. Our system referred to as NeuroSIM can perform complex multi-hop reasoning over multi-object scenes.
arXiv Detail & Related papers (2023-05-23T17:59:10Z)
ViperGPT: Visual Inference via Python Execution for Reasoning [23.56704214763551]
We introduce ViperGPT, a framework that composes vision-and-language models into subroutines to produce a result for any query. This simple approach requires no further training, and achieves state-of-the-art results across various complex visual tasks.
arXiv Detail & Related papers (2023-03-14T17:57:47Z)
Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks. For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs. Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z)
Images Speak in Images: A Generalist Painter for In-Context Visual Learning [98.78475432114595]
In-context learning allows the model to rapidly adapt to various tasks with only a handful of prompts and examples. It is unclear how to define the general-purpose task prompts that the vision model can understand and transfer to out-of-domain tasks. We present Painter, a generalist model which redefines the output of core vision tasks as images, and specify task prompts as also images.
arXiv Detail & Related papers (2022-12-05T18:59:50Z)
Learning compositional programs with arguments and sampling [12.790055619773565]
We train a machine learning model to discover a program that satisfies specific requirements. We extend upon a state of the art model, AlphaNPI, by learning to generate functions that can accept arguments.
arXiv Detail & Related papers (2021-09-01T21:27:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.