ReCorD: Reasoning and Correcting Diffusion for HOI Generation
- URL: http://arxiv.org/abs/2407.17911v1
- Date: Thu, 25 Jul 2024 10:06:26 GMT
- Title: ReCorD: Reasoning and Correcting Diffusion for HOI Generation
- Authors: Jian-Yu Jiang-Lin, Kang-Yang Huang, Ling Lo, Yi-Ning Huang, Terence Lin, Jhih-Ciang Wu, Hong-Han Shuai, Wen-Huang Cheng,
- Abstract summary: We introduce Reasoning and Correcting Diffusion (ReCorD) to address these challenges.
Our model couples Latent Diffusion Models with Visual Language Models to refine the generation process.
We conduct comprehensive experiments on three benchmarks to demonstrate the significant progress in solving text-to-image generation tasks.
- Score: 26.625822483049426
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion models revolutionize image generation by leveraging natural language to guide the creation of multimedia content. Despite significant advancements in such generative models, challenges persist in depicting detailed human-object interactions, especially regarding pose and object placement accuracy. We introduce a training-free method named Reasoning and Correcting Diffusion (ReCorD) to address these challenges. Our model couples Latent Diffusion Models with Visual Language Models to refine the generation process, ensuring precise depictions of HOIs. We propose an interaction-aware reasoning module to improve the interpretation of the interaction, along with an interaction correcting module to refine the output image for more precise HOI generation delicately. Through a meticulous process of pose selection and object positioning, ReCorD achieves superior fidelity in generated images while efficiently reducing computational requirements. We conduct comprehensive experiments on three benchmarks to demonstrate the significant progress in solving text-to-image generation tasks, showcasing ReCorD's ability to render complex interactions accurately by outperforming existing methods in HOI classification score, as well as FID and Verb CLIP-Score. Project website is available at https://alberthkyhky.github.io/ReCorD/ .
Related papers
- DILLEMA: Diffusion and Large Language Models for Multi-Modal Augmentation [0.13124513975412253]
We present a novel framework for testing vision neural networks that leverages Large Language Models and control-conditioned Diffusion Models.
Our approach begins by translating images into detailed textual descriptions using a captioning model.
These descriptions are then used to produce new test images through a text-to-image diffusion process.
arXiv Detail & Related papers (2025-02-05T16:35:42Z) - Visual Autoregressive Modeling for Image Super-Resolution [14.935662351654601]
We propose a novel visual autoregressive modeling for ISR framework with the form of next-scale prediction.
We collect large-scale data and design a training process to obtain robust generative priors.
arXiv Detail & Related papers (2025-01-31T09:53:47Z) - EditAR: Unified Conditional Generation with Autoregressive Models [58.093860528672735]
We propose EditAR, a single unified autoregressive framework for a variety of conditional image generation tasks.
The model takes both images and instructions as inputs, and predicts the edited images tokens in a vanilla next-token paradigm.
We evaluate its effectiveness across diverse tasks on established benchmarks, showing competitive performance to various state-of-the-art task-specific methods.
arXiv Detail & Related papers (2025-01-08T18:59:35Z) - Distillation of Diffusion Features for Semantic Correspondence [23.54555663670558]
We propose a novel knowledge distillation technique to overcome the problem of reduced efficiency.
We show how to use two large vision foundation models and distill the capabilities of these complementary models into one smaller model that maintains high accuracy at reduced computational cost.
Our empirical results demonstrate that our distilled model with 3D data augmentation achieves performance superior to current state-of-the-art methods while significantly reducing computational load and enhancing practicality for real-world applications, such as semantic video correspondence.
arXiv Detail & Related papers (2024-12-04T17:55:33Z) - MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling [64.09238330331195]
We propose a novel Multi-Modal Auto-Regressive (MMAR) probabilistic modeling framework.
Unlike discretization line of method, MMAR takes in continuous-valued image tokens to avoid information loss.
We show that MMAR demonstrates much more superior performance than other joint multi-modal models.
arXiv Detail & Related papers (2024-10-14T17:57:18Z) - Iterative Object Count Optimization for Text-to-image Diffusion Models [59.03672816121209]
Current models, which learn from image-text pairs, inherently struggle with counting.
We propose optimizing the generated image based on a counting loss derived from a counting model that aggregates an object's potential.
We evaluate the generation of various objects and show significant improvements in accuracy.
arXiv Detail & Related papers (2024-08-21T15:51:46Z) - Towards Better Text-to-Image Generation Alignment via Attention Modulation [16.020834525343997]
We propose an attribution-focusing mechanism, a training-free phase-wise mechanism by modulation of attention for diffusion model.
An object-focused masking scheme and a phase-wise dynamic weight control mechanism are integrated into the cross-attention modules.
The experimental results in various alignment scenarios demonstrate that our model attains better image-text alignment with minimal additional computational cost.
arXiv Detail & Related papers (2024-04-22T06:18:37Z) - RenAIssance: A Survey into AI Text-to-Image Generation in the Era of
Large Model [93.8067369210696]
Text-to-image generation (TTI) refers to the usage of models that could process text input and generate high fidelity images based on text descriptions.
Diffusion models are one prominent type of generative model used for the generation of images through the systematic introduction of noises with repeating steps.
In the era of large models, scaling up model size and the integration with large language models have further improved the performance of TTI models.
arXiv Detail & Related papers (2023-09-02T03:27:20Z) - Composing Ensembles of Pre-trained Models via Iterative Consensus [95.10641301155232]
We propose a unified framework for composing ensembles of different pre-trained models.
We use pre-trained models as "generators" or "scorers" and compose them via closed-loop iterative consensus optimization.
We demonstrate that consensus achieved by an ensemble of scorers outperforms the feedback of a single scorer.
arXiv Detail & Related papers (2022-10-20T18:46:31Z) - DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation [56.514462874501675]
We propose a dynamic sparse attention based Transformer model to achieve fine-level matching with favorable efficiency.
The heart of our approach is a novel dynamic-attention unit, dedicated to covering the variation on the optimal number of tokens one position should focus on.
Experiments on three applications, pose-guided person image generation, edge-based face synthesis, and undistorted image style transfer, demonstrate that DynaST achieves superior performance in local details.
arXiv Detail & Related papers (2022-07-13T11:12:03Z) - Incorporating Reinforced Adversarial Learning in Autoregressive Image
Generation [39.55651747758391]
We propose to use Reinforced Adversarial Learning (RAL) based on policy gradient optimization for autoregressive models.
RAL also empowers the collaboration between different modules of the VQ-VAE framework.
The proposed method achieves state-of-the-art results on Celeba for 64 $times$ 64 image resolution.
arXiv Detail & Related papers (2020-07-20T08:10:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.