UniAPO: Unified Multimodal Automated Prompt Optimization
- URL: http://arxiv.org/abs/2508.17890v1
- Date: Mon, 25 Aug 2025 10:56:39 GMT
- Title: UniAPO: Unified Multimodal Automated Prompt Optimization
- Authors: Qipeng Zhu, Yanzhe Chen, Huasong Zhong, Yan Li, Jie Chen, Zhixin Zhang, Junping Zhang, Zhenheng Yang,
- Abstract summary: We present UniAPO: Unified Multimodal Automated Prompt Optimization, the first framework tailored for multimodal APO.<n>UniAPO mitigates consistent gains across text, image, and video benchmarks, establishing a unified framework for efficient and transferable prompt optimization.
- Score: 37.74430773789572
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prompting is fundamental to unlocking the full potential of large language models. To automate and enhance this process, automatic prompt optimization (APO) has been developed, demonstrating effectiveness primarily in text-only input scenarios. However, extending existing APO methods to multimodal tasks, such as video-language generation introduces two core challenges: (i) visual token inflation, where long visual token sequences restrict context capacity and result in insufficient feedback signals; (ii) a lack of process-level supervision, as existing methods focus on outcome-level supervision and overlook intermediate supervision, limiting prompt optimization. We present UniAPO: Unified Multimodal Automated Prompt Optimization, the first framework tailored for multimodal APO. UniAPO adopts an EM-inspired optimization process that decouples feedback modeling and prompt refinement, making the optimization more stable and goal-driven. To further address the aforementioned challenges, we introduce a short-long term memory mechanism: historical feedback mitigates context limitations, while historical prompts provide directional guidance for effective prompt optimization. UniAPO achieves consistent gains across text, image, and video benchmarks, establishing a unified framework for efficient and transferable prompt optimization.
Related papers
- Bi-Level Prompt Optimization for Multimodal LLM-as-a-Judge [21.61898421774144]
Large language models (LLMs) have become widely adopted as automated judges for evaluating AI-generated content.<n>Despite their success, aligning LLM-based evaluations with human judgments remains challenging.<n>We propose BLPO, a bi-level prompt optimization framework that converts images into textual representations while preserving evaluation-relevant visual cues.
arXiv Detail & Related papers (2026-02-11T20:22:13Z) - Learning from Prompt itself: the Hierarchical Attribution Prompt Optimization [13.8868879878572]
A structured optimization approach requires automated or semi-automated procedures to develop improved prompts.<n>Current prompt optimization methods often induce prompt drift, where new prompts fix prior failures but impair performance on previously successful tasks.<n>This study proposes the Hierarchical Prompt Optimization framework, which introduces three innovations: (1) a dynamic attribution mechanism targeting error patterns in training data and prompting history, (2) semantic-unit optimization for editing functional prompt segments, and (3) multimodal-friendly progression supporting both end-to-end LLM and LLM-MLLM.
arXiv Detail & Related papers (2026-01-06T03:34:17Z) - Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs [65.46953412737419]
We introduce the new problem of multimodal prompt optimization, which expands the prior definition of prompt optimization to the multimodal space defined by the pairs of non-textual prompts.<n>We show that MPO outperforms leading text-only optimization methods, establishing multimodal prompt optimization as a crucial step to realizing the potential of MLLMs.
arXiv Detail & Related papers (2025-10-10T09:41:25Z) - P3: Prompts Promote Prompting [26.16464064171255]
Large language model (LLM) applications often employ multi-component prompts, comprising both system and user prompts.<n>In this work, we introduce P3, a novel self-improvement framework that concurrently optimize both system and user prompts.<n>Extensive experiments on general tasks demonstrate that P3 achieves superior performance in the realm of automatic prompt optimization.
arXiv Detail & Related papers (2025-07-21T14:37:46Z) - Rethinking Prompt Optimization: Reinforcement, Diversification, and Migration in Blackbox LLMs [10.434732630519377]
We propose a novel Automatic Prompt Optimization (APO) framework centered on enhancing the feedback mechanism.<n>To mitigate the noise inherent in LLM-generated feedback, we introduce a technique called feedback diversification.<n>Our approach consistently outperforms strong baselines, achieving significant accuracy improvements, faster convergence, and lower computational costs.
arXiv Detail & Related papers (2025-07-14T00:20:14Z) - ORPP: Self-Optimizing Role-playing Prompts to Enhance Language Model Capabilities [64.24517317344959]
High-quality prompts are crucial for eliciting outstanding performance from large language models on complex tasks.<n>We propose ORPP, a framework that enhances model performance by optimizing and generating role-playing prompts.<n>We show that ORPP not only matches but in most cases surpasses existing mainstream prompt optimization methods in terms of performance.
arXiv Detail & Related papers (2025-06-03T05:51:35Z) - GAPO: Learning Preferential Prompt through Generative Adversarial Policy Optimization [28.85371253733727]
We introduce Generative Adversarial Policy Optimization (GAPO), a novel framework that combines GAN-based training dynamics with an encoder-only reward model.<n>Extensive experiments demonstrate GAPO's superior performance across multiple benchmarks.
arXiv Detail & Related papers (2025-03-26T03:37:52Z) - M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning [90.75075886543404]
Multimodal Large Language Models (MLLMs) demonstrate remarkable performance across a wide range of domains.
In this work, we introduce a novel Multimodal Prompt Tuning (M$2$PT) approach for efficient instruction tuning of MLLMs.
arXiv Detail & Related papers (2024-09-24T01:40:24Z) - QPO: Query-dependent Prompt Optimization via Multi-Loop Offline Reinforcement Learning [58.767866109043055]
We introduce Query-dependent Prompt Optimization (QPO), which iteratively fine-tune a small pretrained language model to generate optimal prompts tailored to the input queries.<n>We derive insights from offline prompting demonstration data, which already exists in large quantities as a by-product of benchmarking diverse prompts on open-sourced tasks.<n> Experiments on various LLM scales and diverse NLP and math tasks demonstrate the efficacy and cost-efficiency of our method in both zero-shot and few-shot scenarios.
arXiv Detail & Related papers (2024-08-20T03:06:48Z) - mDPO: Conditional Preference Optimization for Multimodal Large Language Models [52.607764280030196]
Direct preference optimization (DPO) has shown to be an effective method for large language model (LLM) alignment.
Recent works have attempted to apply DPO to multimodal scenarios but have found it challenging to achieve consistent improvement.
We propose mDPO, a multimodal DPO objective that prevents the over-prioritization of language-only preferences by also optimizing image preference.
arXiv Detail & Related papers (2024-06-17T17:59:58Z) - SEE: Strategic Exploration and Exploitation for Cohesive In-Context Prompt Optimization [8.975505323004427]
We propose a novel Cohesive In-Context Prompt Optimization framework for Large Language Models (LLMs)<n>We introduce SEE, a scalable and efficient prompt optimization framework that adopts metaheuristic optimization principles and strategically exploration and exploitation.<n> SEE significantly outperforms state-of-the-art baseline methods by a large margin, achieving an average performance gain of 13.94 while reducing computational costs by 58.67.
arXiv Detail & Related papers (2024-02-17T17:47:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.