POS: A Prompts Optimization Suite for Augmenting Text-to-Video Generation
- URL: http://arxiv.org/abs/2311.00949v3
- Date: Mon, 10 Jun 2024 03:16:09 GMT
- Title: POS: A Prompts Optimization Suite for Augmenting Text-to-Video Generation
- Authors: Shijie Ma, Huayi Xu, Mengjian Li, Weidong Geng, Yaxiong Wang, Meng Wang,
- Abstract summary: This paper aims to enhance the diffusion-based text-to-video generation by improving the two input prompts, including the noise and the text.
We propose POS, a training-free Prompt Optimization Suite to boost text-to-video models.
- Score: 11.556147036111222
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper targets to enhance the diffusion-based text-to-video generation by improving the two input prompts, including the noise and the text. Accommodated with this goal, we propose POS, a training-free Prompt Optimization Suite to boost text-to-video models. POS is motivated by two observations: (1) Video generation shows instability in terms of noise. Given the same text, different noises lead to videos that differ significantly in terms of both frame quality and temporal consistency. This observation implies that there exists an optimal noise matched to each textual input; To capture the potential noise, we propose an optimal noise approximator to approach the potential optimal noise. Particularly, the optimal noise approximator initially searches a video that closely relates to the text prompt and then inverts it into the noise space to serve as an improved noise prompt for the textual input. (2) Improving the text prompt via LLMs often causes semantic deviation. Many existing text-to-vision works have utilized LLMs to improve the text prompts for generation enhancement. However, existing methods often neglect the semantic alignment between the original text and the rewritten one. In response to this issue, we design a semantic-preserving rewriter to impose contraints in both rewritng and denoising phrases to preserve the semantic consistency. Extensive experiments on popular benchmarks show that our POS can improve the text-to-video models with a clear margin. The code will be open-sourced.
Related papers
- Text-Audio-Visual-conditioned Diffusion Model for Video Saliency Prediction [17.85550556489256]
We propose TAVDiff, a Text-Audio-Visual-conditioned Diffusion Model for video saliency prediction.
To effectively utilize text, a large multimodal model is used to generate textual descriptions for video frames.
Regarding the auditory modality, it is used as another conditional information for directing the model to focus on salient regions indicated by sounds.
arXiv Detail & Related papers (2025-04-19T11:30:54Z) - Mimir: Improving Video Diffusion Models for Precise Text Understanding [53.72393225042688]
Text serves as the key control signal in video generation due to its narrative nature.
The recent success of large language models (LLMs) showcases the power of decoder-only transformers.
This work addresses this challenge with Mimir, an end-to-end training framework featuring a carefully tailored token fuser.
arXiv Detail & Related papers (2024-12-04T07:26:44Z) - Human Speech Perception in Noise: Can Large Language Models Paraphrase to Improve It? [26.835947209927273]
Large Language Models (LLMs) can generate text by transferring style attributes like formality resulting in formal or informal text.
We conduct the first study to evaluate LLMs on a novel task of generating acoustically intelligible paraphrases for better human speech perception in noise.
Our approach resulted in a 40% relative improvement in human speech perception, by paraphrasing utterances that are highly distorted in a listening condition with babble noise at a signal-to-noise ratio (SNR) -5 dB.
arXiv Detail & Related papers (2024-08-07T18:24:23Z) - DNTextSpotter: Arbitrary-Shaped Scene Text Spotting via Improved Denoising Training [17.734265617973293]
We propose a novel denoising training method (DNTextSpotter) for arbitrary-shaped text spotting.
DNTextSpotter decomposes the queries of the denoising part into noised positional queries and noised content queries.
It outperforms the state-of-the-art methods on four benchmarks.
arXiv Detail & Related papers (2024-08-01T07:52:07Z) - Learning Robust Named Entity Recognizers From Noisy Data With Retrieval Augmentation [67.89838237013078]
Named entity recognition (NER) models often struggle with noisy inputs.
We propose a more realistic setting in which only noisy text and its NER labels are available.
We employ a multi-view training framework that improves robust NER without retrieving text during inference.
arXiv Detail & Related papers (2024-07-26T07:30:41Z) - MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query.
Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity.
This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text.
In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z) - Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks.
We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model.
Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z) - FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling [85.60543452539076]
Existing video generation models are typically trained on a limited number of frames, resulting in the inability to generate high-fidelity long videos during inference.
This study explores the potential of extending the text-driven capability to generate longer videos conditioned on multiple texts.
We propose FreeNoise, a tuning-free and time-efficient paradigm to enhance the generative capabilities of pretrained video diffusion models.
arXiv Detail & Related papers (2023-10-23T17:59:58Z) - Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature
Alignment [16.304894187743013]
TEFAL is a TExt-conditioned Feature ALignment method that produces both audio and video representations conditioned on the text query.
Our approach employs two independent cross-modal attention blocks that enable the text to attend to the audio and video representations separately.
arXiv Detail & Related papers (2023-07-24T17:43:13Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - DiffAVA: Personalized Text-to-Audio Generation with Visual Alignment [30.38594416942543]
We propose a novel and personalized text-to-sound generation approach with visual alignment based on latent diffusion models, namely DiffAVA.
Our DiffAVA leverages a multi-head attention transformer to aggregate temporal information from video features, and a dual multi-modal residual network to fuse temporal visual representations with text embeddings.
Experimental results on the AudioCaps dataset demonstrate that the proposed DiffAVA can achieve competitive performance on visual-aligned text-to-audio generation.
arXiv Detail & Related papers (2023-05-22T10:37:27Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - Text2Video: Text-driven Talking-head Video Synthesis with Phonetic
Dictionary [10.590649169151055]
We present a novel approach to synthesize video from the text.
The method builds a phoneme-pose dictionary and trains a generative adversarial network (GAN) to generate video.
Compared to audio-driven video generation algorithms, our approach has a number of advantages.
arXiv Detail & Related papers (2021-04-29T19:54:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.