Related papers: Vision-Driven Prompt Optimization for Large Language Models in Multimodal Generative Tasks

Vision-Driven Prompt Optimization for Large Language Models in Multimodal Generative Tasks

URL: http://arxiv.org/abs/2501.02527v1
Date: Sun, 05 Jan 2025 13:01:47 GMT
Title: Vision-Driven Prompt Optimization for Large Language Models in Multimodal Generative Tasks
Authors: Leo Franklin, Apiradee Boonmee, Kritsada Wongsuwan,
Abstract summary: Vision-Driven Prompt Optimization (VDPO) generates textual prompts from visual inputs, guiding high-fidelity image synthesis.<n>VDPO consistently outperforms existing methods, achieving significant improvements in FID, LPIPS, and BLEU/CIDEr scores.<n>Human evaluations further validate the practical superiority of VDPO in generating visually appealing and semantically coherent outputs.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision generation remains a challenging frontier in artificial intelligence, requiring seamless integration of visual understanding and generative capabilities. In this paper, we propose a novel framework, Vision-Driven Prompt Optimization (VDPO), that leverages Large Language Models (LLMs) to dynamically generate textual prompts from visual inputs, guiding high-fidelity image synthesis. VDPO combines a visual embedding prompt tuner, a textual instruction generator, and a vision generation module to achieve state-of-the-art performance in diverse vision generation tasks. Extensive experiments on benchmarks such as COCO and Sketchy demonstrate that VDPO consistently outperforms existing methods, achieving significant improvements in FID, LPIPS, and BLEU/CIDEr scores. Additional analyses reveal the scalability, robustness, and generalization capabilities of VDPO, making it a versatile solution for in-domain and out-of-domain tasks. Human evaluations further validate the practical superiority of VDPO in generating visually appealing and semantically coherent outputs.

Related papers

VizGenie: Toward Self-Refining, Domain-Aware Workflows for Next-Generation Scientific Visualization [12.826592849136215]
VizGenie is a framework that advances scientific visualization through large language model (LLM)<n>A distinctive feature of VizGenie is its intuitive natural language interface, allowing users to issue high-level feature-based queries.
arXiv Detail & Related papers (2025-07-18T23:54:22Z)
Instruction-Guided Fusion of Multi-Layer Visual Features in Large Vision-Language Models [50.98559225639266]
We investigate the contributions of visual features from different encoder layers using 18 benchmarks spanning 6 task categories. Our findings reveal that multilayer features provide complementary strengths with varying task dependencies, and uniform fusion leads to suboptimal performance. We propose the instruction-guided vision aggregator, a module that dynamically integrates multi-layer visual features based on textual instructions.
arXiv Detail & Related papers (2024-12-26T05:41:31Z)
VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use [74.39058448757645]
We present VipAct, an agent framework that enhances vision-language models (VLMs) VipAct consists of an orchestrator agent, which manages task requirement analysis, planning, and coordination, along with specialized agents that handle specific tasks. We evaluate VipAct on benchmarks featuring a diverse set of visual perception tasks, with experimental results demonstrating significant performance improvements.
arXiv Detail & Related papers (2024-10-21T18:10:26Z)
VL-GLUE: A Suite of Fundamental yet Challenging Visuo-Linguistic Reasoning Tasks [48.67062958311173]
VL-GLUE is a multitask benchmark for natural language understanding. We show that this benchmark is quite challenging for existing large-scale vision-language models.
arXiv Detail & Related papers (2024-10-17T15:27:17Z)
Harmonizing Visual Text Comprehension and Generation [31.605599298507293]
We present TextHarmony, a unified and versatile multimodal generative model proficient in comprehending and generating visual text. We propose Slide-LoRA, which aggregates modality-specific and modality-agnostic LoRA experts, partially decoupling the multimodal generation space. Comprehensive experiments across various benchmarks demonstrate the effectiveness of the proposed approach.
arXiv Detail & Related papers (2024-07-23T10:11:56Z)
Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation [87.50120181861362]
VisionPrefer is a high-quality and fine-grained preference dataset that captures multiple preference aspects. We train a reward model VP-Score over VisionPrefer to guide the training of text-to-image generative models and the preference prediction accuracy of VP-Score is comparable to human annotators.
arXiv Detail & Related papers (2024-04-23T14:53:15Z)
Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models [56.76307866160105]
We propose a contrastive learning framework, termed Document Object COntrastive learning (DoCo) DoCo leverages an auxiliary multimodal encoder to obtain the features of document objects and align them to the visual features generated by the vision encoder of Large Visual-Language Models (LVLMs) We demonstrate that the proposed DoCo serves as a plug-and-play pre-training method, which can be employed in the pre-training of various LVLMs without inducing any increase in computational complexity during the inference process.
arXiv Detail & Related papers (2024-02-29T10:17:27Z)
Question Aware Vision Transformer for Multimodal Reasoning [14.188369270753347]
We introduce QA-ViT, a Question Aware Vision Transformer approach for multimodal reasoning. It embeds question awareness directly within the vision encoder. This integration results in dynamic visual features focusing on relevant image aspects to the posed question.
arXiv Detail & Related papers (2024-02-08T08:03:39Z)
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model [83.85856356798531]
VistaLLM is a visual system that addresses coarse- and fine-grained vision-language tasks. It employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences. We also introduce a novel task, AttCoSeg, which boosts the model's reasoning and grounding capability over multiple input images.
arXiv Detail & Related papers (2023-12-19T18:53:01Z)
Delving into Multimodal Prompting for Fine-grained Visual Classification [57.12570556836394]
Fine-grained visual classification (FGVC) involves categorizing fine subdivisions within a broader category. Recent advancements in pre-trained vision-language models have demonstrated remarkable performance in various high-level vision tasks. We propose a novel multimodal prompting solution, denoted as MP-FGVC, based on the contrastive language-image subcategory (CLIP) model.
arXiv Detail & Related papers (2023-09-16T07:30:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.