Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs
- URL: http://arxiv.org/abs/2510.09201v1
- Date: Fri, 10 Oct 2025 09:41:25 GMT
- Title: Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs
- Authors: Yumin Choi, Dongki Kim, Jinheon Baek, Sung Ju Hwang,
- Abstract summary: We introduce the new problem of multimodal prompt optimization, which expands the prior definition of prompt optimization to the multimodal space defined by the pairs of non-textual prompts.<n>We show that MPO outperforms leading text-only optimization methods, establishing multimodal prompt optimization as a crucial step to realizing the potential of MLLMs.
- Score: 65.46953412737419
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have shown remarkable success, and their multimodal expansions (MLLMs) further unlock capabilities spanning images, videos, and other modalities beyond text. However, despite this shift, prompt optimization approaches, designed to reduce the burden of manual prompt crafting while maximizing performance, remain confined to text, ultimately limiting the full potential of MLLMs. Motivated by this gap, we introduce the new problem of multimodal prompt optimization, which expands the prior definition of prompt optimization to the multimodal space defined by the pairs of textual and non-textual prompts. To tackle this problem, we then propose the Multimodal Prompt Optimizer (MPO), a unified framework that not only performs the joint optimization of multimodal prompts through alignment-preserving updates but also guides the selection process of candidate prompts by leveraging earlier evaluations as priors in a Bayesian-based selection strategy. Through extensive experiments across diverse modalities that go beyond text, such as images, videos, and even molecules, we demonstrate that MPO outperforms leading text-only optimization methods, establishing multimodal prompt optimization as a crucial step to realizing the potential of MLLMs.
Related papers
- Bi-Level Prompt Optimization for Multimodal LLM-as-a-Judge [21.61898421774144]
Large language models (LLMs) have become widely adopted as automated judges for evaluating AI-generated content.<n>Despite their success, aligning LLM-based evaluations with human judgments remains challenging.<n>We propose BLPO, a bi-level prompt optimization framework that converts images into textual representations while preserving evaluation-relevant visual cues.
arXiv Detail & Related papers (2026-02-11T20:22:13Z) - UniAPO: Unified Multimodal Automated Prompt Optimization [37.74430773789572]
We present UniAPO: Unified Multimodal Automated Prompt Optimization, the first framework tailored for multimodal APO.<n>UniAPO mitigates consistent gains across text, image, and video benchmarks, establishing a unified framework for efficient and transferable prompt optimization.
arXiv Detail & Related papers (2025-08-25T10:56:39Z) - GReaTer: Gradients over Reasoning Makes Smaller Language Models Strong Prompt Optimizers [52.17222304851524]
We introduce GReaTer, a novel prompt optimization technique that directly incorporates gradient information over task-specific reasoning.<n>By utilizing task loss gradients, GReaTer enables self-optimization of prompts for open-source, lightweight language models.<n> GReaTer consistently outperforms previous state-of-the-art prompt optimization methods.
arXiv Detail & Related papers (2024-12-12T20:59:43Z) - M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning [90.75075886543404]
Multimodal Large Language Models (MLLMs) demonstrate remarkable performance across a wide range of domains.
In this work, we introduce a novel Multimodal Prompt Tuning (M$2$PT) approach for efficient instruction tuning of MLLMs.
arXiv Detail & Related papers (2024-09-24T01:40:24Z) - QPO: Query-dependent Prompt Optimization via Multi-Loop Offline Reinforcement Learning [58.767866109043055]
We introduce Query-dependent Prompt Optimization (QPO), which iteratively fine-tune a small pretrained language model to generate optimal prompts tailored to the input queries.<n>We derive insights from offline prompting demonstration data, which already exists in large quantities as a by-product of benchmarking diverse prompts on open-sourced tasks.<n> Experiments on various LLM scales and diverse NLP and math tasks demonstrate the efficacy and cost-efficiency of our method in both zero-shot and few-shot scenarios.
arXiv Detail & Related papers (2024-08-20T03:06:48Z) - Fine-tuning Multimodal Large Language Models for Product Bundling [53.01642741096356]
We introduce Bundle-MLLM, a novel framework that fine-tunes large language models (LLMs) through a hybrid item tokenization approach.<n>Specifically, we integrate textual, media, and relational data into a unified tokenization, introducing a soft separation token to distinguish between textual and non-textual tokens.<n>We propose a progressive optimization strategy that fine-tunes LLMs for disentangled objectives: 1) learning bundle patterns and 2) enhancing multimodal semantic understanding specific to product bundling.
arXiv Detail & Related papers (2024-07-16T13:30:14Z) - MAPO: Boosting Large Language Model Performance with Model-Adaptive Prompt Optimization [73.7779735046424]
We show that different prompts should be adapted to different Large Language Models (LLM) to enhance their capabilities across various downstream tasks in NLP.
We then propose a model-adaptive prompt (MAPO) method that optimize the original prompts for each specific LLM in downstream tasks.
arXiv Detail & Related papers (2024-07-04T18:39:59Z) - How Multimodal Integration Boost the Performance of LLM for
Optimization: Case Study on Capacitated Vehicle Routing Problems [33.33996058215666]
Large language models (LLMs) have positioned themselves as capable tools for addressing complex optimization challenges.
We propose to enhance the optimization performance using multimodal LLM capable of processing both textual and visual prompts.
arXiv Detail & Related papers (2024-03-04T06:24:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.