Related papers: A Continual Learning Paradigm for Non-differentiable Visual Programming Frameworks on Visual Reasoning Tasks

A Continual Learning Paradigm for Non-differentiable Visual Programming Frameworks on Visual Reasoning Tasks

URL: http://arxiv.org/abs/2309.09809v2
Date: Thu, 30 Nov 2023 09:31:59 GMT
Title: A Continual Learning Paradigm for Non-differentiable Visual Programming Frameworks on Visual Reasoning Tasks
Authors: Wentao Wan, Nan Kang, Zeqing Wang, Zhuojie Yang, Liang Lin, Keze Wang
Abstract summary: We propose a Continuous Learning paradigm for VisProg across various visual reasoning tasks. Our CLVP distills the capabilities of well-trained task-specific models into the visual sub-modules in a stepwise and anti-forgetting manner.
Score: 51.053901491986025
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, the visual programming framework (VisProg) has emerged as a significant framework for executing compositional visual tasks due to its interpretability and flexibility. However, the performance of VisProg on specific Visual Reasoning (VR) tasks is markedly inferior compared to well-trained task-specific models since its employed visual sub-modules have limited generalization capabilities. Due to the non-differentiability of VisProg, it is quite challenging to improve these visual sub-modules within VisProg for the specific VR task while maintaining their generalizability on the un-seen tasks. Attempt to overcome these difficulties, we propose CLVP, a Continuous Learning paradigm for VisProg across various visual reasoning tasks. Specifically, our CLVP distills the capabilities of well-trained task-specific models into the visual sub-modules in a stepwise and anti-forgetting manner. This can continually improve the performance of VisProg on multiple visual tasks while preserving the flexibility of VisProg. Extensive and comprehensive experimental results demonstrate that our CLVP obtains significant performance gains on specific VR benchmarks, i.e., GQA (+1.4%) and NLVRv2 (+5.6%), compared to the VisProg baseline, and also maintains a promising generalizability for VR on un-seen and previous learned tasks.

Related papers

Test-Time Visual In-Context Tuning [85.62916644835902]
Visual in-context learning (VICL) allows the model to rapidly adapt to various tasks with only a handful of prompts and examples. While effective, the existing VICL paradigm exhibits poor generalizability under distribution shifts. We propose test-time Visual In-Context Tuning (VICT), a method that can adapt VICL models on the fly with a single test sample.
arXiv Detail & Related papers (2025-03-27T17:59:52Z)
ViSTa Dataset: Do vision-language models understand sequential tasks? [6.039062076849557]
Using vision-language models (VLMs) as reward models in reinforcement learning holds promise for reducing costs and improving safety. We introduce ViSTa, a dataset for evaluating Vision-based understanding of Sequential Tasks. ViSTa comprises over 4,000 videos with step-by-step descriptions in virtual home, Minecraft, and real-world environments.
arXiv Detail & Related papers (2024-11-20T11:19:22Z)
Vision Language Models are In-Context Value Learners [89.29486557646624]
We present Generative Value Learning (GVL), a universal value function estimator that leverages the world knowledge embedded in vision-language models (VLMs) to predict task progress. Without any robot or task specific training, GVL can in-context zero-shot and few-shot predict effective values for more than 300 distinct real-world tasks.
arXiv Detail & Related papers (2024-11-07T09:17:50Z)
VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use [74.39058448757645]
We present VipAct, an agent framework that enhances vision-language models (VLMs) VipAct consists of an orchestrator agent, which manages task requirement analysis, planning, and coordination, along with specialized agents that handle specific tasks. We evaluate VipAct on benchmarks featuring a diverse set of visual perception tasks, with experimental results demonstrating significant performance improvements.
arXiv Detail & Related papers (2024-10-21T18:10:26Z)
From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning [26.21049702284394]
Large vision language models (VLMs) combine large language models with vision encoders, demonstrating promise across various tasks. We introduce VITask, a novel framework that enhances task-specific adaptability of VLMs by integrating task-specific models (TSMs)
arXiv Detail & Related papers (2024-10-09T01:24:04Z)
@Bench: Benchmarking Vision-Language Models for Human-centered Assistive Technology [31.779074930032184]
Human-centered Assistive Technologies (ATs) for helping People with Visual Impairments (PVIs) are evolving into generalists, capable of performing multiple tasks simultaneously. We first create a novel AT benchmark (@Bench) guided by a pre-design user study with PVIs. Besides, we propose a novel AT model (@Model) that addresses all tasks simultaneously and can be expanded to more assistive functions for helping PVIs.
arXiv Detail & Related papers (2024-09-21T18:30:17Z)
Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning [53.93074108238167]
We construct Vision-Flan, the most diverse publicly available visual instruction tuning dataset to date. We propose a two-stage instruction tuning framework, in which VLMs are finetuned on Vision-Flan and further tuned on GPT-4 synthesized data. We find this two-stage tuning framework significantly outperforms the traditional single-stage visual instruction tuning framework.
arXiv Detail & Related papers (2024-02-18T19:38:44Z)
Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey [59.95153883166705]
Traditional computer vision generally solves each single task independently by a dedicated model with the task instruction implicitly designed in the model architecture. Visual Instruction Tuning (VIT) has been intensively studied recently, which finetunes a large vision model with language as task instructions. This work aims to provide a systematic review of visual instruction tuning, covering (1) the background that presents computer vision task paradigms and the development of VIT; (2) the foundations of VIT that introduce commonly used network architectures, visual instruction tuning frameworks and objectives, and evaluation setups and tasks; and (3) the commonly used datasets in visual instruction tuning and evaluation.
arXiv Detail & Related papers (2023-12-27T14:54:37Z)
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model [83.85856356798531]
VistaLLM is a visual system that addresses coarse- and fine-grained vision-language tasks. It employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences. We also introduce a novel task, AttCoSeg, which boosts the model's reasoning and grounding capability over multiple input images.
arXiv Detail & Related papers (2023-12-19T18:53:01Z)
Hierarchical Side-Tuning for Vision Transformers [33.536948382414316]
Fine-tuning pre-trained Vision Transformers (ViTs) has showcased significant promise in enhancing visual recognition tasks. PETL has shown potential for achieving high performance with fewer parameter updates compared to full fine-tuning. This paper introduces Hierarchical Side-Tuning (HST), an innovative PETL method facilitating the transfer of ViT models to diverse downstream tasks.
arXiv Detail & Related papers (2023-10-09T04:16:35Z)
Delving into Multimodal Prompting for Fine-grained Visual Classification [57.12570556836394]
Fine-grained visual classification (FGVC) involves categorizing fine subdivisions within a broader category. Recent advancements in pre-trained vision-language models have demonstrated remarkable performance in various high-level vision tasks. We propose a novel multimodal prompting solution, denoted as MP-FGVC, based on the contrastive language-image subcategory (CLIP) model.
arXiv Detail & Related papers (2023-09-16T07:30:52Z)
One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks [59.49639580525051]
multimodal models are aimed at solving Vision and Language (V+L) tasks. Current work assumes that a textitsingle pre-trained VE can serve as a general-purpose encoder. In this work, we focus on analysis and aim to understand whether the information stored within different VEs is complementary.
arXiv Detail & Related papers (2022-10-12T16:31:39Z)
Task Formulation Matters When Learning Continually: A Case Study in Visual Question Answering [58.82325933356066]
Continual learning aims to train a model incrementally on a sequence of tasks without forgetting previous knowledge. We present a detailed study of how different settings affect performance for Visual Question Answering.
arXiv Detail & Related papers (2022-09-30T19:12:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.