Prompt-Propose-Verify: A Reliable Hand-Object-Interaction Data
Generation Framework using Foundational Models
- URL: http://arxiv.org/abs/2312.15247v1
- Date: Sat, 23 Dec 2023 12:59:22 GMT
- Title: Prompt-Propose-Verify: A Reliable Hand-Object-Interaction Data
Generation Framework using Foundational Models
- Authors: Gurusha Juneja and Sukrit Kumar
- Abstract summary: Diffusion models when conditioned on text prompts, generate realistic-looking images with intricate details.
But most of these pre-trained models fail to generate accurate images when it comes to human features like hands, teeth, etc.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion models when conditioned on text prompts, generate realistic-looking
images with intricate details. But most of these pre-trained models fail to
generate accurate images when it comes to human features like hands, teeth,
etc. We hypothesize that this inability of diffusion models can be overcome
through well-annotated good-quality data. In this paper, we look specifically
into improving the hand-object-interaction image generation using diffusion
models. We collect a well annotated hand-object interaction synthetic dataset
curated using Prompt-Propose-Verify framework and finetune a stable diffusion
model on it. We evaluate the image-text dataset on qualitative and quantitative
metrics like CLIPScore, ImageReward, Fedility, and alignment and show
considerably better performance over the current state-of-the-art benchmarks.
Related papers
- Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models [65.82564074712836]
We introduce DIFfusionHOI, a new HOI detector shedding light on text-to-image diffusion models.
We first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space.
These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions.
arXiv Detail & Related papers (2024-10-26T12:00:33Z) - Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention
Regulation in Diffusion Models [23.786473791344395]
Cross-attention layers in diffusion models tend to disproportionately focus on certain tokens during the generation process.
We introduce attention regulation, an on-the-fly optimization approach at inference time to align attention maps with the input text prompt.
Experiment results show that our method consistently outperforms other baselines.
arXiv Detail & Related papers (2024-03-11T02:18:27Z) - Learned representation-guided diffusion models for large-image generation [58.192263311786824]
We introduce a novel approach that trains diffusion models conditioned on embeddings from self-supervised learning (SSL)
Our diffusion models successfully project these features back to high-quality histopathology and remote sensing images.
Augmenting real data by generating variations of real images improves downstream accuracy for patch-level and larger, image-scale classification tasks.
arXiv Detail & Related papers (2023-12-12T14:45:45Z) - Steered Diffusion: A Generalized Framework for Plug-and-Play Conditional
Image Synthesis [62.07413805483241]
Steered Diffusion is a framework for zero-shot conditional image generation using a diffusion model trained for unconditional generation.
We present experiments using steered diffusion on several tasks including inpainting, colorization, text-guided semantic editing, and image super-resolution.
arXiv Detail & Related papers (2023-09-30T02:03:22Z) - Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners [88.07317175639226]
We propose a novel approach, Discriminative Stable Diffusion (DSD), which turns pre-trained text-to-image diffusion models into few-shot discriminative learners.
Our approach mainly uses the cross-attention score of a Stable Diffusion model to capture the mutual influence between visual and textual information.
arXiv Detail & Related papers (2023-05-18T05:41:36Z) - Generating images of rare concepts using pre-trained diffusion models [32.5337654536764]
Text-to-image diffusion models can synthesize high-quality images, but they have various limitations.
We show that their limitation is partly due to the long-tail nature of their training data.
We show that rare concepts can be correctly generated by carefully selecting suitable generation seeds in the noise space.
arXiv Detail & Related papers (2023-04-27T20:55:38Z) - Masked Images Are Counterfactual Samples for Robust Fine-tuning [77.82348472169335]
Fine-tuning deep learning models can lead to a trade-off between in-distribution (ID) performance and out-of-distribution (OOD) robustness.
We propose a novel fine-tuning method, which uses masked images as counterfactual samples that help improve the robustness of the fine-tuning model.
arXiv Detail & Related papers (2023-03-06T11:51:28Z) - Person Image Synthesis via Denoising Diffusion Model [116.34633988927429]
We show how denoising diffusion models can be applied for high-fidelity person image synthesis.
Our results on two large-scale benchmarks and a user study demonstrate the photorealism of our proposed approach under challenging scenarios.
arXiv Detail & Related papers (2022-11-22T18:59:50Z) - GLIDE: Towards Photorealistic Image Generation and Editing with
Text-Guided Diffusion Models [16.786221846896108]
We explore diffusion models for the problem of text-conditional image synthesis and compare two different guidance strategies.
We find that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples.
Our models can be fine-tuned to perform image inpainting, enabling powerful text-driven image editing.
arXiv Detail & Related papers (2021-12-20T18:42:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.