Synthesizing Visual Concepts as Vision-Language Programs
- URL: http://arxiv.org/abs/2511.18964v1
- Date: Mon, 24 Nov 2025 10:30:33 GMT
- Title: Synthesizing Visual Concepts as Vision-Language Programs
- Authors: Antonia Wüst, Wolfgang Stammer, Hikaru Shindo, Lukas Helff, Devendra Singh Dhami, Kristian Kersting,
- Abstract summary: Vision-Language models (VLMs) achieve strong performance on multimodal tasks but often fail at systematic visual reasoning tasks.<n>We propose Vision-Language Programs, which combine the perceptual flexibility of VLMs with systematic reasoning of program synthesis.
- Score: 39.89156658712712
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-Language models (VLMs) achieve strong performance on multimodal tasks but often fail at systematic visual reasoning tasks, leading to inconsistent or illogical outputs. Neuro-symbolic methods promise to address this by inducing interpretable logical rules, though they exploit rigid, domain-specific perception modules. We propose Vision-Language Programs (VLP), which combine the perceptual flexibility of VLMs with systematic reasoning of program synthesis. Rather than embedding reasoning inside the VLM, VLP leverages the model to produce structured visual descriptions that are compiled into neuro-symbolic programs. The resulting programs execute directly on images, remain consistent with task constraints, and provide human-interpretable explanations that enable easy shortcut mitigation. Experiments on synthetic and real-world datasets demonstrate that VLPs outperform direct and structured prompting, particularly on tasks requiring complex logical reasoning.
Related papers
- A Neurosymbolic Agent System for Compositional Visual Reasoning [31.649454833851863]
Existing vision-language models (VLMs) remain challenged by compositional visual reasoning.<n>This paper presents a neuro-symbolic approach to developing a Vision-Language Agent system for efficient compositional visual reasoning.
arXiv Detail & Related papers (2025-06-09T13:55:55Z) - Integrating Visual Interpretation and Linguistic Reasoning for Math Problem Solving [61.992824291296444]
Current large vision-language models (LVLMs) typically employ a connector module to link visual features with text embeddings of large language models (LLMs)<n>This paper proposes a paradigm shift: instead of training end-to-end vision-language reasoning models, we advocate for developing a decoupled reasoning framework.
arXiv Detail & Related papers (2025-05-23T08:18:00Z) - VIPER: Visual Perception and Explainable Reasoning for Sequential Decision-Making [17.313485392764353]
VIPER is a novel framework for multimodal instruction-based planning.<n>It integrates VLM-based perception with LLM-based reasoning.<n>We show that VIPER significantly outperforms state-of-the-art visual instruction-based planners.
arXiv Detail & Related papers (2025-03-19T11:05:42Z) - Proof of Thought : Neurosymbolic Program Synthesis allows Robust and Interpretable Reasoning [1.3003982724617653]
Large Language Models (LLMs) have revolutionized natural language processing, yet they struggle with inconsistent reasoning.
This research introduces Proof of Thought, a framework that enhances the reliability and transparency of LLM outputs.
Key contributions include a robust type system with sort management for enhanced logical integrity, explicit representation of rules for clear distinction between factual and inferential knowledge.
arXiv Detail & Related papers (2024-09-25T18:35:45Z) - Large Language Models are Interpretable Learners [53.56735770834617]
In this paper, we show a combination of Large Language Models (LLMs) and symbolic programs can bridge the gap between expressiveness and interpretability.
The pretrained LLM with natural language prompts provides a massive set of interpretable modules that can transform raw input into natural language concepts.
As the knowledge learned by LSP is a combination of natural language descriptions and symbolic rules, it is easily transferable to humans (interpretable) and other LLMs.
arXiv Detail & Related papers (2024-06-25T02:18:15Z) - Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement [93.73648674743097]
Visual program synthesis is a promising approach to exploit the reasoning abilities of large language models for compositional computer vision tasks.
Previous work has used few-shot prompting with frozen LLMs to synthesize visual programs.
No dataset of visual programs for training exists, and acquisition of a visual program dataset cannot be easily crowdsourced.
arXiv Detail & Related papers (2024-04-06T13:25:00Z) - VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding [65.12464615430036]
This paper introduces a Video Understanding and Reasoning Framework (VURF) based on the reasoning power of Large Language Models (LLMs)<n>Ours is a novel approach to extend the utility of LLMs in the context of video tasks, leveraging their capacity to generalize from minimal input and output demonstrations within a contextual framework.
arXiv Detail & Related papers (2024-03-21T18:00:00Z) - Image Translation as Diffusion Visual Programmers [52.09889190442439]
Diffusion Visual Programmer (DVP) is a neuro-symbolic image translation framework.
Our framework seamlessly embeds a condition-flexible diffusion model within the GPT architecture.
Extensive experiments demonstrate DVP's remarkable performance, surpassing concurrent arts.
arXiv Detail & Related papers (2024-01-18T05:50:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.