Learning Profitable NFT Image Diffusions via Multiple Visual-Policy
Guided Reinforcement Learning
- URL: http://arxiv.org/abs/2306.11731v2
- Date: Thu, 17 Aug 2023 17:57:26 GMT
- Title: Learning Profitable NFT Image Diffusions via Multiple Visual-Policy
Guided Reinforcement Learning
- Authors: Huiguo He, Tianfu Wang, Huan Yang, Jianlong Fu, Nicholas Jing Yuan,
Jian Yin, Hongyang Chao, Qi Zhang
- Abstract summary: We propose a Diffusion-based generation framework with Multiple Visual-Policies as rewards for NFT images.
The proposed framework consists of a large language model (LLM), a diffusion-based image generator, and a series of visual rewards by design.
Our framework can generate NFT images showing more visually engaging elements and higher market value, compared with SOTA approaches.
- Score: 69.60868581184366
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the task of generating profitable Non-Fungible Token (NFT) images
from user-input texts. Recent advances in diffusion models have shown great
potential for image generation. However, existing works can fall short in
generating visually-pleasing and highly-profitable NFT images, mainly due to
the lack of 1) plentiful and fine-grained visual attribute prompts for an NFT
image, and 2) effective optimization metrics for generating high-quality NFT
images. To solve these challenges, we propose a Diffusion-based generation
framework with Multiple Visual-Policies as rewards (i.e., Diffusion-MVP) for
NFT images. The proposed framework consists of a large language model (LLM), a
diffusion-based image generator, and a series of visual rewards by design.
First, the LLM enhances a basic human input (such as "panda") by generating
more comprehensive NFT-style prompts that include specific visual attributes,
such as "panda with Ninja style and green background." Second, the
diffusion-based image generator is fine-tuned using a large-scale NFT dataset
to capture fine-grained image styles and accessory compositions of popular NFT
elements. Third, we further propose to utilize multiple visual-policies as
optimization goals, including visual rarity levels, visual aesthetic scores,
and CLIP-based text-image relevances. This design ensures that our proposed
Diffusion-MVP is capable of minting NFT images with high visual quality and
market value. To facilitate this research, we have collected the largest
publicly available NFT image dataset to date, consisting of 1.5 million
high-quality images with corresponding texts and market values. Extensive
experiments including objective evaluations and user studies demonstrate that
our framework can generate NFT images showing more visually engaging elements
and higher market value, compared with SOTA approaches.
Related papers
- Flex3D: Feed-Forward 3D Generation With Flexible Reconstruction Model And Input View Curation [61.040832373015014]
We propose Flex3D, a novel framework for generating high-quality 3D content from text, single images, or sparse view images.
We employ a fine-tuned multi-view image diffusion model and a video diffusion model to generate a pool of candidate views, enabling a rich representation of the target 3D object.
In the second stage, the curated views are fed into a Flexible Reconstruction Model (FlexRM), built upon a transformer architecture that can effectively process an arbitrary number of inputs.
arXiv Detail & Related papers (2024-10-01T17:29:43Z) - Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining [48.98105914356609]
Lumina-mGPT is a family of multimodal autoregressive models capable of various vision and language tasks.
We introduce Ominiponent Supervised Finetuning, transforming Lumina-mGPT into a foundation model that seamlessly achieves omnipotent task unification.
arXiv Detail & Related papers (2024-08-05T17:46:53Z) - Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data [80.92268916571712]
A critical bottleneck is the scarcity of high-quality 3D objects with detailed captions.
We propose Bootstrap3D, a novel framework that automatically generates an arbitrary quantity of multi-view images.
We have generated 1 million high-quality synthetic multi-view images with dense descriptive captions.
arXiv Detail & Related papers (2024-05-31T17:59:56Z) - Enhance Image Classification via Inter-Class Image Mixup with Diffusion Model [80.61157097223058]
A prevalent strategy to bolster image classification performance is through augmenting the training set with synthetic images generated by T2I models.
In this study, we scrutinize the shortcomings of both current generative and conventional data augmentation techniques.
We introduce an innovative inter-class data augmentation method known as Diff-Mix, which enriches the dataset by performing image translations between classes.
arXiv Detail & Related papers (2024-03-28T17:23:45Z) - PROMPT-IML: Image Manipulation Localization with Pre-trained Foundation
Models Through Prompt Tuning [35.39822183728463]
We present a novel Prompt-IML framework for detecting tampered images.
Humans tend to discern authenticity of an image based on semantic and high-frequency information.
Our model can achieve better performance on eight typical fake image datasets.
arXiv Detail & Related papers (2024-01-01T03:45:07Z) - TF-ICON: Diffusion-Based Training-Free Cross-Domain Image Composition [13.087647740473205]
TF-ICON is a framework that harnesses the power of text-driven diffusion models for cross-domain image-guided composition.
TF-ICON can leverage off-the-shelf diffusion models to perform cross-domain image-guided composition without requiring additional training, finetuning, or optimization.
Our experiments show that equipping Stable Diffusion with the exceptional prompt outperforms state-of-the-art inversion methods on various datasets.
arXiv Detail & Related papers (2023-07-24T02:50:44Z) - Diverse Image Inpainting with Bidirectional and Autoregressive
Transformers [55.21000775547243]
We propose BAT-Fill, an image inpainting framework with a novel bidirectional autoregressive transformer (BAT)
BAT-Fill inherits the merits of transformers and CNNs in a two-stage manner, which allows to generate high-resolution contents without being constrained by the quadratic complexity of attention in transformers.
arXiv Detail & Related papers (2021-04-26T03:52:27Z) - Deep Attentive Generative Adversarial Network for Photo-Realistic Image
De-Quantization [25.805568996596783]
De-quantization can improve the visual quality of low bit-depth image to display on high bit-depth screen.
This paper proposes DAGAN algorithm to perform super-resolution on image intensity resolution.
DenseResAtt module consists of dense residual blocks armed with self-attention mechanism.
arXiv Detail & Related papers (2020-04-07T06:45:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.