Related papers: RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling

RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling

URL: http://arxiv.org/abs/2510.20206v1
Date: Thu, 23 Oct 2025 04:45:09 GMT
Title: RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling
Authors: Bingjie Gao, Qianli Ma, Xiaoxue Wu, Shuai Yang, Guanzhou Lan, Haonan Zhao, Jiaxuan Chen, Qingyang Liu, Yu Qiao, Xinyuan Chen, Yaohui Wang, Li Niu,
Abstract summary: textbfRAPO++ is a cross-stage prompt optimization framework.<n>It unifies training-data-aligned refinement, test-time iterative scaling, and large language model fine-tuning.<n> RAPO++ achieves significant gains in semantic alignment, compositional reasoning, temporal stability, and physical plausibility.
Score: 59.088798018184235
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Prompt design plays a crucial role in text-to-video (T2V) generation, yet user-provided prompts are often short, unstructured, and misaligned with training data, limiting the generative potential of diffusion-based T2V models. We present \textbf{RAPO++}, a cross-stage prompt optimization framework that unifies training-data--aligned refinement, test-time iterative scaling, and large language model (LLM) fine-tuning to substantially improve T2V generation without modifying the underlying generative backbone. In \textbf{Stage 1}, Retrieval-Augmented Prompt Optimization (RAPO) enriches user prompts with semantically relevant modifiers retrieved from a relation graph and refactors them to match training distributions, enhancing compositionality and multi-object fidelity. \textbf{Stage 2} introduces Sample-Specific Prompt Optimization (SSPO), a closed-loop mechanism that iteratively refines prompts using multi-source feedback -- including semantic alignment, spatial fidelity, temporal coherence, and task-specific signals such as optical flow -- yielding progressively improved video generation quality. \textbf{Stage 3} leverages optimized prompt pairs from SSPO to fine-tune the rewriter LLM, internalizing task-specific optimization patterns and enabling efficient, high-quality prompt generation even before inference. Extensive experiments across five state-of-the-art T2V models and five benchmarks demonstrate that RAPO++ achieves significant gains in semantic alignment, compositional reasoning, temporal stability, and physical plausibility, outperforming existing methods by large margins. Our results highlight RAPO++ as a model-agnostic, cost-efficient, and scalable solution that sets a new standard for prompt optimization in T2V generation. The code is available at https://github.com/Vchitect/RAPO.

Related papers

Retrieval, Refinement, and Ranking for Text-to-Video Generation via Prompt Optimization and Test-Time Scaling [1.6671050178877669]
Large-scale datasets have driven significant progress in Text-to-Video (T2V) generative models.<n>Current methods for improving video output often fall short.<n>We introduce 3R, a novel RAG based prompt optimization framework.
arXiv Detail & Related papers (2026-03-02T06:35:59Z)
RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment [37.59966317174412]
We introduce RAISE, a training-free, requirement-driven evolutionary framework for adaptive T2I generation.<n> RAISE formulates image generation as a requirement-driven adaptive scaling process.<n>On GenEval and DrawBench, RAISE attains state-of-the-art alignment.
arXiv Detail & Related papers (2026-02-28T05:53:01Z)
IAR2: Improving Autoregressive Visual Generation with Semantic-Detail Associated Token Prediction [77.06211178777939]
IAR2 is an advanced autoregressive framework that enables a hierarchical semantic-detail synthesis process.<n>We show that IAR2 sets a new state-of-the-art for autoregressive image generation, achieving a FID of 1.50 on ImageNet.
arXiv Detail & Related papers (2025-10-08T12:08:21Z)
Structured Information for Improving Spatial Relationships in Text-to-Image Generation [23.552628360388823]
This work introduces a lightweight approach that augments prompts with structured information, using a fine-tuned language model for automatic conversion and seamless integration into T2I pipelines.<n> Experimental results demonstrate substantial improvements in spatial accuracy, without compromising image quality as measured by Inception Score.<n>This structured information provides a practical and portable solution to enhance spatial relationships in T2I generation, addressing a key limitation of current generative systems.
arXiv Detail & Related papers (2025-09-19T13:20:34Z)
The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation [40.73687553764341]
We introduce RAPO, a novel Retrieval-Augmented Prompt Optimization framework.<n> RAPO refines user prompts through dual optimization branches, selecting the superior prompt for T2V generation.<n>Extensive experiments demonstrate that RAPO can effectively enhance both the static and dynamic dimensions of generated videos.
arXiv Detail & Related papers (2025-04-16T03:33:25Z)
Fast Prompt Alignment for Text-to-Image Generation [28.66112701912297]
This paper introduces Fast Prompt Alignment (FPA), a prompt optimization framework that leverages a one-pass approach.<n>FPA uses large language models (LLMs) for single-iteration prompt paraphrasing, followed by fine-tuning or in-context learning with optimized prompts.<n>FPA achieves competitive text-image alignment scores at a fraction of the processing time.
arXiv Detail & Related papers (2024-12-11T18:58:41Z)
PGSO: Prompt-based Generative Sequence Optimization Network for Aspect-based Sentiment Analysis [9.617652261815671]
We introduce two sequence optimization strategies: the rule-based static optimization and the score-based dynamic optimization.<n>Based on the dynamic optimization structure, we propose a unified Prompt-based Generative Sequence Optimization network (named PGSO)<n>Experiments conducted on four ABSA tasks across multiple benchmarks indicate that PGSO outperforms state-of-the-art methods, with an average improvement of 3.52% in F1 score.
arXiv Detail & Related papers (2024-12-01T10:49:55Z)
Minority-Focused Text-to-Image Generation via Prompt Optimization [57.319845580050924]
We investigate the generation of minority samples using pretrained text-to-image (T2I) latent diffusion models.<n>We develop an online prompt optimization framework that encourages emergence of desired properties during inference.<n>We then tailor this generic prompt distributions into a specialized solver that promotes generation of minority features.
arXiv Detail & Related papers (2024-10-10T11:56:09Z)
In-context Demonstration Matters: On Prompt Optimization for Pseudo-Supervision Refinement [71.60563181678323]
Large language models (LLMs) have achieved great success across diverse tasks, and fine-tuning is sometimes needed to further enhance generation quality.<n>To handle these challenges, a direct solution is to generate high-confidence'' data from unsupervised downstream tasks.<n>We propose a novel approach, pseudo-supervised demonstrations aligned prompt optimization (PAPO) algorithm, which jointly refines both the prompt and the overall pseudo-supervision.
arXiv Detail & Related papers (2024-10-04T03:39:28Z)
TS-HTFA: Advancing Time Series Forecasting via Hierarchical Text-Free Alignment with Large Language Models [14.411646409316624]
We introduce textbfHierarchical textbfText-textbfFree textbfAlignment (textbfTS-HTFA), a novel method for time-series forecasting.<n>We replace paired text data with adaptive virtual text based on QR decomposition word embeddings and learnable prompt.<n>Experiments on multiple time-series benchmarks demonstrate that HTFA achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-09-23T12:57:24Z)
UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs [111.05657299071648]
UIO-LLMs is an incremental optimization approach for memory-enhanced transformers under long-context settings.<n>We refine the training process using the Truncated Backpropagation Through Time (TBPTT) algorithm.<n>UIO-LLMs successfully handle long context, such as extending the context window of Llama2-7b-chat from 4K to 100K tokens with minimal 2% additional parameters.
arXiv Detail & Related papers (2024-06-26T08:44:36Z)
Conditional Denoising Diffusion for Sequential Recommendation [62.127862728308045]
Two prominent generative models, Generative Adversarial Networks (GANs) and Variational AutoEncoders (VAEs) GANs suffer from unstable optimization, while VAEs are prone to posterior collapse and over-smoothed generations. We present a conditional denoising diffusion model, which includes a sequence encoder, a cross-attentive denoising decoder, and a step-wise diffuser.
arXiv Detail & Related papers (2023-04-22T15:32:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.