Related papers: A Stepwise Distillation Learning Strategy for Non-differentiable Visual Programming Frameworks on Visual Reasoning Tasks

A Stepwise Distillation Learning Strategy for Non-differentiable Visual Programming Frameworks on Visual Reasoning Tasks

URL: http://arxiv.org/abs/2309.09809v3
Date: Sat, 22 Feb 2025 10:43:20 GMT
Title: A Stepwise Distillation Learning Strategy for Non-differentiable Visual Programming Frameworks on Visual Reasoning Tasks
Authors: Wentao Wan, Nan Kang, Zeqing Wang, Zhuojie Yang, Liang Lin, Keze Wang,
Abstract summary: We propose Stepwise Distillation learning strategy for non-differentiable VPorg across various VR tasks.<n>Our SDVP stepwise distills the capabilities of existing, well-trained small task-specific models for visual sub-tasks in VProg into the much larger VLMs invoked by corresponding visual sub-modules.
Score: 48.181520570707654
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, Visual Programming (VProg) has emerged as a significant framework for visual reasoning (VR) tasks due to its interpretability and cross-task generality. However, even with invoking powerful pre-trained Vision-Language models (VLMs) as visual sub-modules, the performance of VProg on specific VR tasks is markedly inferior compared to well-trained task-specific networks. Although invoking task-specific models can further enhance the performance of VProg on specific VR tasks, it greatly diminishes the cross-task generalization ability of VProg. Besides, the non-differentiable nature of VProg prevents direct fine-tuning on specific VR tasks for further performance improvement. Attempt to address these issues, we propose SDVP, a Stepwise Distillation learning strategy for non-differentiable VPorg across various VR tasks. Specifically, our SDVP stepwise distills the capabilities of existing, well-trained small task-specific models for decomposed visual sub-tasks in VProg into the much larger VLMs invoked by corresponding visual sub-modules. Besides, distilling the knowledge of little-size task-specific models into pre-trained larger VLMs rather than replacing them helps keep the cross-task abilities of VProgs. Extensive and comprehensive experimental results on different VProg frameworks demonstrate that our SDVP obtains significant performance gains on specific VR benchmarks, i.e., GQA (+2.4\%) and NLVRv2 (+6.2\%) for VisProg and GQA (+6.5\%) and NLVRv2 (+4.0\%) for ViperGPT, and also maintains a promising performance for VProg on unseen and previous VR tasks.

Related papers

Learning to See and Act: Task-Aware View Planning for Robotic Manipulation [85.65102094981802]
Task-Aware View Planning (TAVP) is a framework designed to integrate active view planning with task-specific representation learning.<n>Our proposed TAVP model achieves superior performance over state-of-the-art fixed-view approaches.
arXiv Detail & Related papers (2025-08-07T09:21:20Z)
Beyond Task-Specific Reasoning: A Unified Conditional Generative Framework for Abstract Visual Reasoning [52.107043437362556]
Abstract visual reasoning (AVR) enables humans to quickly discover and generalize abstract rules to new scenarios.<n>This paper proposes a novel Unified Conditional Generative Solver (UCGS) to address multiple tasks in a unified framework.<n>UCGS exhibits the ability of zero-shot reasoning, enabling it to perform abstract reasoning on problems from unseen tasks in the testing phase.
arXiv Detail & Related papers (2025-07-15T21:54:51Z)
Test-Time Visual In-Context Tuning [85.62916644835902]
Visual in-context learning (VICL) allows the model to rapidly adapt to various tasks with only a handful of prompts and examples. While effective, the existing VICL paradigm exhibits poor generalizability under distribution shifts. We propose test-time Visual In-Context Tuning (VICT), a method that can adapt VICL models on the fly with a single test sample.
arXiv Detail & Related papers (2025-03-27T17:59:52Z)
ViSTa Dataset: Do vision-language models understand sequential tasks? [6.039062076849557]
Using vision-language models (VLMs) as reward models in reinforcement learning holds promise for reducing costs and improving safety. We introduce ViSTa, a dataset for evaluating Vision-based understanding of Sequential Tasks. ViSTa comprises over 4,000 videos with step-by-step descriptions in virtual home, Minecraft, and real-world environments.
arXiv Detail & Related papers (2024-11-20T11:19:22Z)
Vision Language Models are In-Context Value Learners [89.29486557646624]
We present Generative Value Learning (GVL), a universal value function estimator that leverages the world knowledge embedded in vision-language models (VLMs) to predict task progress. Without any robot or task specific training, GVL can in-context zero-shot and few-shot predict effective values for more than 300 distinct real-world tasks.
arXiv Detail & Related papers (2024-11-07T09:17:50Z)
VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use [74.39058448757645]
We present VipAct, an agent framework that enhances vision-language models (VLMs) VipAct consists of an orchestrator agent, which manages task requirement analysis, planning, and coordination, along with specialized agents that handle specific tasks. We evaluate VipAct on benchmarks featuring a diverse set of visual perception tasks, with experimental results demonstrating significant performance improvements.
arXiv Detail & Related papers (2024-10-21T18:10:26Z)
From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning [26.21049702284394]
Large vision language models (VLMs) combine large language models with vision encoders, demonstrating promise across various tasks. We introduce VITask, a novel framework that enhances task-specific adaptability of VLMs by integrating task-specific models (TSMs)
arXiv Detail & Related papers (2024-10-09T01:24:04Z)
@Bench: Benchmarking Vision-Language Models for Human-centered Assistive Technology [31.779074930032184]
Human-centered Assistive Technologies (ATs) for helping People with Visual Impairments (PVIs) are evolving into generalists, capable of performing multiple tasks simultaneously. We first create a novel AT benchmark (@Bench) guided by a pre-design user study with PVIs. Besides, we propose a novel AT model (@Model) that addresses all tasks simultaneously and can be expanded to more assistive functions for helping PVIs.
arXiv Detail & Related papers (2024-09-21T18:30:17Z)
Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning [53.93074108238167]
We construct Vision-Flan, the most diverse publicly available visual instruction tuning dataset to date. We propose a two-stage instruction tuning framework, in which VLMs are finetuned on Vision-Flan and further tuned on GPT-4 synthesized data. We find this two-stage tuning framework significantly outperforms the traditional single-stage visual instruction tuning framework.
arXiv Detail & Related papers (2024-02-18T19:38:44Z)
Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey [59.95153883166705]
Traditional computer vision generally solves each single task independently by a dedicated model with the task instruction implicitly designed in the model architecture. Visual Instruction Tuning (VIT) has been intensively studied recently, which finetunes a large vision model with language as task instructions. This work aims to provide a systematic review of visual instruction tuning, covering (1) the background that presents computer vision task paradigms and the development of VIT; (2) the foundations of VIT that introduce commonly used network architectures, visual instruction tuning frameworks and objectives, and evaluation setups and tasks; and (3) the commonly used datasets in visual instruction tuning and evaluation.
arXiv Detail & Related papers (2023-12-27T14:54:37Z)
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model [83.85856356798531]
VistaLLM is a visual system that addresses coarse- and fine-grained vision-language tasks. It employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences. We also introduce a novel task, AttCoSeg, which boosts the model's reasoning and grounding capability over multiple input images.
arXiv Detail & Related papers (2023-12-19T18:53:01Z)
Hierarchical Side-Tuning for Vision Transformers [33.536948382414316]
Fine-tuning pre-trained Vision Transformers (ViTs) has showcased significant promise in enhancing visual recognition tasks. PETL has shown potential for achieving high performance with fewer parameter updates compared to full fine-tuning. This paper introduces Hierarchical Side-Tuning (HST), an innovative PETL method facilitating the transfer of ViT models to diverse downstream tasks.
arXiv Detail & Related papers (2023-10-09T04:16:35Z)
Delving into Multimodal Prompting for Fine-grained Visual Classification [57.12570556836394]
Fine-grained visual classification (FGVC) involves categorizing fine subdivisions within a broader category. Recent advancements in pre-trained vision-language models have demonstrated remarkable performance in various high-level vision tasks. We propose a novel multimodal prompting solution, denoted as MP-FGVC, based on the contrastive language-image subcategory (CLIP) model.
arXiv Detail & Related papers (2023-09-16T07:30:52Z)
One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks [59.49639580525051]
multimodal models are aimed at solving Vision and Language (V+L) tasks. Current work assumes that a textitsingle pre-trained VE can serve as a general-purpose encoder. In this work, we focus on analysis and aim to understand whether the information stored within different VEs is complementary.
arXiv Detail & Related papers (2022-10-12T16:31:39Z)
Task Formulation Matters When Learning Continually: A Case Study in Visual Question Answering [58.82325933356066]
Continual learning aims to train a model incrementally on a sequence of tasks without forgetting previous knowledge. We present a detailed study of how different settings affect performance for Visual Question Answering.
arXiv Detail & Related papers (2022-09-30T19:12:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.