Enhancing Visual Programming for Visual Reasoning via Probabilistic Graphs
- URL: http://arxiv.org/abs/2512.14257v1
- Date: Tue, 16 Dec 2025 10:07:40 GMT
- Title: Enhancing Visual Programming for Visual Reasoning via Probabilistic Graphs
- Authors: Wentao Wan, Kaiyu Wu, Qingyang Ma, Nan Kang, Yunjie Chen, Liang Lin, Keze Wang,
- Abstract summary: We propose EVPG, a method to enhance Visual Programming for visual reasoning via Probabilistic Graphs.<n>We show significant performance improvements for VP on three classical complex VR tasks: GQA, NLVRv2, and Open Images.
- Score: 47.54638493412879
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, Visual Programming (VP) based on large language models (LLMs) has rapidly developed and demonstrated significant potential in complex Visual Reasoning (VR) tasks. Previous works to enhance VP have primarily focused on improving the quality of LLM-generated visual programs. However, they have neglected to optimize the VP-invoked pre-trained models, which serve as modules for the visual sub-tasks decomposed from the targeted tasks by VP. The difficulty is that there are only final labels of targeted VR tasks rather than labels of sub-tasks. Besides, the non-differentiable nature of VP impedes the direct use of efficient gradient-based optimization methods to leverage final labels for end-to-end learning of the entire VP framework. To overcome these issues, we propose EVPG, a method to Enhance Visual Programming for visual reasoning via Probabilistic Graphs. Specifically, we creatively build a directed probabilistic graph according to the variable dependency relationships during the VP executing process, which reconstructs the non-differentiable VP executing process into a differentiable exact probability inference process on this directed probabilistic graph. As a result, this enables the VP framework to utilize the final labels for efficient, gradient-based optimization in end-to-end supervised learning on targeted VR tasks. Extensive and comprehensive experiments demonstrate the effectiveness and advantages of our EVPG, showing significant performance improvements for VP on three classical complex VR tasks: GQA, NLVRv2, and Open Images.
Related papers
- Enhancing Visual Prompting through Expanded Transformation Space and Overfitting Mitigation [0.9137554315375919]
Visual prompting (VP) has emerged as a promising parameter-efficient fine-tuning approach for adapting pre-trained vision models to downstream tasks.<n>We propose ACAVP (Affine, Color, and Additive Visual Prompting), which enhances VP's expressive power by introducing complementary transformation operations.<n>We show that ACAVP achieves state-of-the-art accuracy among VP methods, surpasses linear probing in average accuracy, and exhibits superior robustness to distribution shifts.
arXiv Detail & Related papers (2025-10-09T06:08:15Z) - Learning to See and Act: Task-Aware View Planning for Robotic Manipulation [88.37482534484627]
Task-Aware View Planning (TAVP) is a framework designed to integrate active view planning with task-specific representation learning.<n>Our proposed TAVP model achieves superior performance over state-of-the-art fixed-view approaches.
arXiv Detail & Related papers (2025-08-07T09:21:20Z) - Event-Priori-Based Vision-Language Model for Efficient Visual Understanding [13.540340702321911]
Event-Priori-Based Vision-Language Model (EP-VLM) improves VLM inference efficiency.<n>EP-VLM uses motion priors derived from dynamic event vision to enhance VLM efficiency.
arXiv Detail & Related papers (2025-06-09T10:45:35Z) - AutoVP: An Automated Visual Prompting Framework and Benchmark [66.5618543577204]
Visual prompting (VP) is an emerging parameter-efficient fine-tuning approach to adapting pre-trained vision models to solve various downstream image-classification tasks.
We propose AutoVP, an end-to-end expandable framework for automating VP design choices, along with 12 downstream image-classification tasks.
Our experimental results show that AutoVP outperforms the best-known current VP methods by a substantial margin.
arXiv Detail & Related papers (2023-10-12T14:55:31Z) - A Stepwise Distillation Learning Strategy for Non-differentiable Visual Programming Frameworks on Visual Reasoning Tasks [48.181520570707654]
We propose Stepwise Distillation learning strategy for non-differentiable VPorg across various VR tasks.<n>Our SDVP stepwise distills the capabilities of existing, well-trained small task-specific models for visual sub-tasks in VProg into the much larger VLMs invoked by corresponding visual sub-modules.
arXiv Detail & Related papers (2023-09-18T14:28:47Z) - Rethinking Visual Prompt Learning as Masked Visual Token Modeling [106.71983630652323]
We propose Visual Prompt learning as masked visual Token Modeling (VPTM) to transform the downstream visual classification into the pre-trained masked visual token prediction.
VPTM is the first visual prompt method on the generative pre-trained visual model, which achieves consistency between pre-training and downstream visual classification by task reformulation.
arXiv Detail & Related papers (2023-03-09T02:43:10Z) - Understanding and Improving Visual Prompting: A Label-Mapping
Perspective [63.89295305670113]
We revisit and advance visual prompting (VP), an input prompting technique for vision tasks.
We propose a new VP framework, termed ILM-VP, which automatically re-maps the source labels to the target labels.
Our proposal significantly outperforms state-of-the-art VP methods.
arXiv Detail & Related papers (2022-11-21T16:49:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.