De-fine: Decomposing and Refining Visual Programs with Auto-Feedback
- URL: http://arxiv.org/abs/2311.12890v2
- Date: Sat, 25 Nov 2023 09:34:39 GMT
- Title: De-fine: Decomposing and Refining Visual Programs with Auto-Feedback
- Authors: Minghe Gao, Juncheng Li, Hao Fei, Liang Pang, Wei Ji, Guoming Wang,
Wenqiao Zhang, Siliang Tang, Yueting Zhuang
- Abstract summary: We introduce De-fine, a framework that decomposes complex tasks into simpler subtasks and refines programs through auto-feedback.
Our experiments across various visual tasks show that De-fine creates more accurate and robust programs, setting new benchmarks in the field.
- Score: 81.08213203440634
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual programming, a modular and generalizable paradigm, integrates
different modules and Python operators to solve various vision-language tasks.
Unlike end-to-end models that need task-specific data, it advances in
performing visual processing and reasoning in an unsupervised manner. Current
visual programming methods generate programs in a single pass for each task
where the ability to evaluate and optimize based on feedback, unfortunately, is
lacking, which consequentially limits their effectiveness for complex,
multi-step problems. Drawing inspiration from benders decomposition, we
introduce De-fine, a general framework that automatically decomposes complex
tasks into simpler subtasks and refines programs through auto-feedback. This
model-agnostic approach can improve logical reasoning performance by
integrating the strengths of multiple models. Our experiments across various
visual tasks show that De-fine creates more accurate and robust programs,
setting new benchmarks in the field.
Related papers
- InsightSee: Advancing Multi-agent Vision-Language Models for Enhanced Visual Understanding [12.082379948480257]
This paper proposes InsightSee, a multi-agent framework to enhance vision-language models' capabilities in handling complex visual understanding scenarios.
The framework comprises a description agent, two reasoning agents, and a decision agent, which are integrated to refine the process of visual information interpretation.
The proposed framework outperforms state-of-the-art algorithms in 6 out of 9 benchmark tests, with a substantial advancement in multimodal understanding.
arXiv Detail & Related papers (2024-05-31T13:56:55Z) - Modeling Output-Level Task Relatedness in Multi-Task Learning with Feedback Mechanism [7.479892725446205]
Multi-task learning (MTL) is a paradigm that simultaneously learns multiple tasks by sharing information at different levels.
We introduce a posteriori information into the model, considering that different tasks may produce correlated outputs with mutual influences.
We achieve this by incorporating a feedback mechanism into MTL models, where the output of one task serves as a hidden feature for another task.
arXiv Detail & Related papers (2024-04-01T03:27:34Z) - Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers [54.83459025465947]
Even the largest models struggle with compositional reasoning, generalization, fine-grained spatial and temporal reasoning, and counting.
Visual reasoning with large language models (LLMs) as controllers can, in principle, address these limitations by decomposing the task and solving subtasks by orchestrating a set of (visual) tools.
We present a framework that mitigates these issues by introducing spatially and temporally abstract routines and by leveraging a small number of labeled examples to automatically generate in-context examples.
arXiv Detail & Related papers (2024-01-03T20:48:47Z) - Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model [83.85856356798531]
VistaLLM is a visual system that addresses coarse- and fine-grained vision-language tasks.
It employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences.
We also introduce a novel task, AttCoSeg, which boosts the model's reasoning and grounding capability over multiple input images.
arXiv Detail & Related papers (2023-12-19T18:53:01Z) - Tuning computer vision models with task rewards [88.45787930908102]
Misalignment between model predictions and intended usage can be detrimental for the deployment of computer vision models.
In natural language processing, this is often addressed using reinforcement learning techniques that align models with a task reward.
We adopt this approach and show its surprising effectiveness across multiple computer vision tasks, such as object detection, panoptic segmentation, colorization and image captioning.
arXiv Detail & Related papers (2023-02-16T11:49:48Z) - Visual Programming: Compositional visual reasoning without training [24.729624386851388]
VISPROG is a neuro-symbolic approach to solving complex and compositional visual tasks.
It uses the in-context learning ability of large language models to generate python-like modular programs.
arXiv Detail & Related papers (2022-11-18T18:50:09Z) - Uni-Perceiver: Pre-training Unified Architecture for Generic Perception
for Zero-shot and Few-shot Tasks [73.63892022944198]
We present a generic perception architecture named Uni-Perceiver.
It processes a variety of modalities and tasks with unified modeling and shared parameters.
Results show that our pre-trained model without any tuning can achieve reasonable performance even on novel tasks.
arXiv Detail & Related papers (2021-12-02T18:59:50Z) - Generative Modeling for Multi-task Visual Learning [40.96212750592383]
We consider a novel problem of learning a shared generative model that is useful across various visual perception tasks.
We propose a general multi-task oriented generative modeling framework, by coupling a discriminative multi-task network with a generative network.
Our framework consistently outperforms state-of-the-art multi-task approaches.
arXiv Detail & Related papers (2021-06-25T03:42:59Z) - How to Design Sample and Computationally Efficient VQA Models [53.65668097847456]
We find that representing the text as probabilistic programs and images as object-level scene graphs best satisfy these desiderata.
We extend existing models to leverage these soft programs and scene graphs to train on question answer pairs in an end-to-end manner.
arXiv Detail & Related papers (2021-03-22T01:48:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.