A Continual Learning Paradigm for Non-differentiable Visual Programming
Frameworks on Visual Reasoning Tasks
- URL: http://arxiv.org/abs/2309.09809v2
- Date: Thu, 30 Nov 2023 09:31:59 GMT
- Title: A Continual Learning Paradigm for Non-differentiable Visual Programming
Frameworks on Visual Reasoning Tasks
- Authors: Wentao Wan, Nan Kang, Zeqing Wang, Zhuojie Yang, Liang Lin, Keze Wang
- Abstract summary: We propose a Continuous Learning paradigm for VisProg across various visual reasoning tasks.
Our CLVP distills the capabilities of well-trained task-specific models into the visual sub-modules in a stepwise and anti-forgetting manner.
- Score: 51.053901491986025
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, the visual programming framework (VisProg) has emerged as a
significant framework for executing compositional visual tasks due to its
interpretability and flexibility. However, the performance of VisProg on
specific Visual Reasoning (VR) tasks is markedly inferior compared to
well-trained task-specific models since its employed visual sub-modules have
limited generalization capabilities. Due to the non-differentiability of
VisProg, it is quite challenging to improve these visual sub-modules within
VisProg for the specific VR task while maintaining their generalizability on
the un-seen tasks. Attempt to overcome these difficulties, we propose CLVP, a
Continuous Learning paradigm for VisProg across various visual reasoning tasks.
Specifically, our CLVP distills the capabilities of well-trained task-specific
models into the visual sub-modules in a stepwise and anti-forgetting manner.
This can continually improve the performance of VisProg on multiple visual
tasks while preserving the flexibility of VisProg. Extensive and comprehensive
experimental results demonstrate that our CLVP obtains significant performance
gains on specific VR benchmarks, i.e., GQA (+1.4%) and NLVRv2 (+5.6%), compared
to the VisProg baseline, and also maintains a promising generalizability for VR
on un-seen and previous learned tasks.
Related papers
- VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use [74.39058448757645]
We present VipAct, an agent framework that enhances vision-language models (VLMs)
VipAct consists of an orchestrator agent, which manages task requirement analysis, planning, and coordination, along with specialized agents that handle specific tasks.
We evaluate VipAct on benchmarks featuring a diverse set of visual perception tasks, with experimental results demonstrating significant performance improvements.
arXiv Detail & Related papers (2024-10-21T18:10:26Z) - From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning [26.21049702284394]
Large vision language models (VLMs) combine large language models with vision encoders, demonstrating promise across various tasks.
We introduce VITask, a novel framework that enhances task-specific adaptability of VLMs by integrating task-specific models (TSMs)
arXiv Detail & Related papers (2024-10-09T01:24:04Z) - @Bench: Benchmarking Vision-Language Models for Human-centered Assistive Technology [31.779074930032184]
Human-centered Assistive Technologies (ATs) for helping People with Visual Impairments (PVIs) are evolving into generalists, capable of performing multiple tasks simultaneously.
We first create a novel AT benchmark (@Bench) guided by a pre-design user study with PVIs.
Besides, we propose a novel AT model (@Model) that addresses all tasks simultaneously and can be expanded to more assistive functions for helping PVIs.
arXiv Detail & Related papers (2024-09-21T18:30:17Z) - Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning [53.93074108238167]
We construct Vision-Flan, the most diverse publicly available visual instruction tuning dataset to date.
We propose a two-stage instruction tuning framework, in which VLMs are finetuned on Vision-Flan and further tuned on GPT-4 synthesized data.
We find this two-stage tuning framework significantly outperforms the traditional single-stage visual instruction tuning framework.
arXiv Detail & Related papers (2024-02-18T19:38:44Z) - Visual Instruction Tuning towards General-Purpose Multimodal Model: A
Survey [59.95153883166705]
Traditional computer vision generally solves each single task independently by a dedicated model with the task instruction implicitly designed in the model architecture.
Visual Instruction Tuning (VIT) has been intensively studied recently, which finetunes a large vision model with language as task instructions.
This work aims to provide a systematic review of visual instruction tuning, covering (1) the background that presents computer vision task paradigms and the development of VIT; (2) the foundations of VIT that introduce commonly used network architectures, visual instruction tuning frameworks and objectives, and evaluation setups and tasks; and (3) the commonly used datasets in visual instruction tuning and evaluation.
arXiv Detail & Related papers (2023-12-27T14:54:37Z) - Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model [83.85856356798531]
VistaLLM is a visual system that addresses coarse- and fine-grained vision-language tasks.
It employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences.
We also introduce a novel task, AttCoSeg, which boosts the model's reasoning and grounding capability over multiple input images.
arXiv Detail & Related papers (2023-12-19T18:53:01Z) - Task Formulation Matters When Learning Continually: A Case Study in
Visual Question Answering [58.82325933356066]
Continual learning aims to train a model incrementally on a sequence of tasks without forgetting previous knowledge.
We present a detailed study of how different settings affect performance for Visual Question Answering.
arXiv Detail & Related papers (2022-09-30T19:12:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.