Visual Program Distillation with Template-Based Augmentation
- URL: http://arxiv.org/abs/2412.08564v3
- Date: Sun, 25 May 2025 06:38:41 GMT
- Title: Visual Program Distillation with Template-Based Augmentation
- Authors: Michal Shlapentokh-Rothman, Yu-Xiong Wang, Derek Hoiem,
- Abstract summary: We propose a low-cost visual program distillation method that requires no human-generated program annotations.<n>With a relatively small amount of question/answer data, small language models can generate high-quality specialized visual programs with the added benefit of much faster inference.
- Score: 36.09275994799905
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Adapting visual programming or prompting large language models (LLMs) to generate executable code for visual tasks like visual question answering (VQA) for specialized tasks or domains remains challenging due to high annotation and inference costs. We propose a low-cost visual program distillation method that can be used for models with at most 1 billion parameters and requires no human-generated program annotations. We achieve this through synthetic data augmentation based on decoupling programs into higher-level skills, called templates, and their corresponding arguments. Experimental results show that, with a relatively small amount of question/answer data, small language models can generate high-quality specialized visual programs with the added benefit of much faster inference
Related papers
- From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis [38.256412418893554]
We explore multi-step reasoning in vision-language models (VLMs)
We first introduce a least-to-most visual reasoning paradigm, which interleaves steps of a question into sub-questions.
We propose a novel data synthesis approach that can automatically create questions and multi-step reasoning paths for an image.
arXiv Detail & Related papers (2024-06-28T14:04:10Z) - Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning [59.13366859237086]
Current solutions for efficiently constructing large vision-language (VL) models follow a two-step paradigm.
We consider visual prompts as additional knowledge that facilitates language models in addressing tasks associated with visual information.
We introduce a novel approach, wherein visual prompts are memoryd with the weights of FFN for visual knowledge injection.
arXiv Detail & Related papers (2024-05-09T08:23:20Z) - Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement [93.73648674743097]
Visual program synthesis is a promising approach to exploit the reasoning abilities of large language models for compositional computer vision tasks.
Previous work has used few-shot prompting with frozen LLMs to synthesize visual programs.
No dataset of visual programs for training exists, and acquisition of a visual program dataset cannot be easily crowdsourced.
arXiv Detail & Related papers (2024-04-06T13:25:00Z) - Learning to Prompt with Text Only Supervision for Vision-Language Models [107.282881515667]
One branch of methods adapts CLIP by learning prompts using visual information.
An alternative approach resorts to training-free methods by generating class descriptions from large language models.
We propose to combine the strengths of both streams by learning prompts using only text data.
arXiv Detail & Related papers (2024-01-04T18:59:49Z) - A Prompt Learning Framework for Source Code Summarization [19.24919436211323]
This paper proposes an effective prompt learning framework for code summarization called PromptCS.<n>PromptCS trains a prompt agent that can generate continuous prompts to unleash the potential for large language models in code summarization.
arXiv Detail & Related papers (2023-12-26T14:37:55Z) - Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models [17.540937747712082]
We propose Visual Program Distillation (VPD), an instruction tuning framework that produces a vision-language model (VLM)
VPD distills the reasoning ability of large language models by using them to sample multiple candidate programs.
It translates each correct program into a language description of the reasoning steps, which are then distilled into a VLM.
arXiv Detail & Related papers (2023-12-05T18:58:37Z) - De-fine: Decomposing and Refining Visual Programs with Auto-Feedback [75.62712247421146]
De-fine is a training-free framework that decomposes complex tasks into simpler subtasks and refines programs through auto-feedback.
Our experiments across various visual tasks show that De-fine creates more robust programs.
arXiv Detail & Related papers (2023-11-21T06:24:09Z) - Learning to Plan with Natural Language [111.76828049344839]
Large Language Models (LLMs) have shown remarkable performance in various basic natural language tasks.
For completing the complex task, we still need a plan for the task to guide LLMs to generate the specific solutions step by step.
We propose the Learning to Plan method, which involves two phases: (1) In the first learning task plan phase, it iteratively updates the task plan with new step-by-step solutions and behavioral instructions, which are obtained by prompting LLMs to derive from training error feedback.
arXiv Detail & Related papers (2023-04-20T17:09:12Z) - Low-code LLM: Graphical User Interface over Large Language Models [115.08718239772107]
This paper introduces a novel human-LLM interaction framework, Low-code LLM.
It incorporates six types of simple low-code visual programming interactions to achieve more controllable and stable responses.
We highlight three advantages of the low-code LLM: user-friendly interaction, controllable generation, and wide applicability.
arXiv Detail & Related papers (2023-04-17T09:27:40Z) - Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot
Image Captioning [153.98100182439165]
We introduce a Retrieval-augmented Visual Language Model, Re-ViLM, built upon the Flamingo.
By storing certain knowledge explicitly in the external database, our approach reduces the number of model parameters.
We demonstrate that Re-ViLM significantly boosts performance for image-to-text generation tasks.
arXiv Detail & Related papers (2023-02-09T18:57:56Z) - Using Large Language Models to Generate Engaging Captions for Data
Visualizations [51.98253121636079]
Large language models (LLM) use sophisticated deep learning technology to produce human-like prose.
Key challenge lies in designing the most effective prompt for the LLM, a task called prompt engineering.
We report on first experiments using the popular LLM GPT-3 and deliver some promising results.
arXiv Detail & Related papers (2022-12-27T23:56:57Z) - Transformer-based Program Synthesis for Low-Data Environments [0.0]
Large pre-trained transformer models (GPT2/3, T5) have found use in program synthesis to generate programs that satisfy a set of input/output examples.
We investigate an approach that tackles both of these issues, by using attributed context-free-grammars of programming languages to generate programs.
We firstly find that synthesized datasets can be made efficiently and can provide transformer models with enough data.
We also find that giving models access to program attributes is especially effective in low-data environments.
arXiv Detail & Related papers (2022-05-18T23:33:33Z) - Learning compositional programs with arguments and sampling [12.790055619773565]
We train a machine learning model to discover a program that satisfies specific requirements.
We extend upon a state of the art model, AlphaNPI, by learning to generate functions that can accept arguments.
arXiv Detail & Related papers (2021-09-01T21:27:41Z) - How to Design Sample and Computationally Efficient VQA Models [53.65668097847456]
We find that representing the text as probabilistic programs and images as object-level scene graphs best satisfy these desiderata.
We extend existing models to leverage these soft programs and scene graphs to train on question answer pairs in an end-to-end manner.
arXiv Detail & Related papers (2021-03-22T01:48:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.