Image Anything: Towards Reasoning-coherent and Training-free Multi-modal
Image Generation
- URL: http://arxiv.org/abs/2401.17664v1
- Date: Wed, 31 Jan 2024 08:35:40 GMT
- Title: Image Anything: Towards Reasoning-coherent and Training-free Multi-modal
Image Generation
- Authors: Yuanhuiyi Lyu, Xu Zheng, Lin Wang
- Abstract summary: ImgAny is a novel end-to-end multi-modal generative model that can mimic human reasoning and generate high-quality images.
Our method serves as the first attempt in its capacity of efficiently and flexibly taking any combination of seven modalities.
- Score: 9.573188010530217
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The multifaceted nature of human perception and comprehension indicates that,
when we think, our body can naturally take any combination of senses, a.k.a.,
modalities and form a beautiful picture in our brain. For example, when we see
a cattery and simultaneously perceive the cat's purring sound, our brain can
construct a picture of a cat in the cattery. Intuitively, generative AI models
should hold the versatility of humans and be capable of generating images from
any combination of modalities efficiently and collaboratively. This paper
presents ImgAny, a novel end-to-end multi-modal generative model that can mimic
human reasoning and generate high-quality images. Our method serves as the
first attempt in its capacity of efficiently and flexibly taking any
combination of seven modalities, ranging from language, audio to vision
modalities, including image, point cloud, thermal, depth, and event data. Our
key idea is inspired by human-level cognitive processes and involves the
integration and harmonization of multiple input modalities at both the entity
and attribute levels without specific tuning across modalities. Accordingly,
our method brings two novel training-free technical branches: 1) Entity Fusion
Branch ensures the coherence between inputs and outputs. It extracts entity
features from the multi-modal representations powered by our specially
constructed entity knowledge graph; 2) Attribute Fusion Branch adeptly
preserves and processes the attributes. It efficiently amalgamates distinct
attributes from diverse input modalities via our proposed attribute knowledge
graph. Lastly, the entity and attribute features are adaptively fused as the
conditional inputs to the pre-trained Stable Diffusion model for image
generation. Extensive experiments under diverse modality combinations
demonstrate its exceptional capability for visual content creation.
Related papers
- One Diffusion to Generate Them All [54.82732533013014]
OneDiffusion is a versatile, large-scale diffusion model that supports bidirectional image synthesis and understanding.
It enables conditional generation from inputs such as text, depth, pose, layout, and semantic maps.
OneDiffusion allows for multi-view generation, camera pose estimation, and instant personalization using sequential image inputs.
arXiv Detail & Related papers (2024-11-25T12:11:05Z) - Infrared and Visible Image Fusion with Hierarchical Human Perception [45.63854455306689]
We introduce an image fusion method, Hierarchical Perception Fusion (HPFusion), which incorporates hierarchical human semantic priors.
We propose multiple questions that humans focus on when viewing an image pair, and answers are generated via the Large Vision-Language Model according to images.
The texts of answers are encoded into the fusion network, and the optimization also aims to guide the human semantic distribution of the fused image more similarly to source images.
arXiv Detail & Related papers (2024-09-14T03:47:26Z) - Equivariant Multi-Modality Image Fusion [124.11300001864579]
We propose the Equivariant Multi-Modality imAge fusion paradigm for end-to-end self-supervised learning.
Our approach is rooted in the prior knowledge that natural imaging responses are equivariant to certain transformations.
Experiments confirm that EMMA yields high-quality fusion results for infrared-visible and medical images.
arXiv Detail & Related papers (2023-05-19T05:50:24Z) - Bi-level Dynamic Learning for Jointly Multi-modality Image Fusion and
Beyond [50.556961575275345]
We build an image fusion module to fuse complementary characteristics and cascade dual task-related modules.
We develop an efficient first-order approximation to compute corresponding gradients and present dynamic weighted aggregation to balance the gradients for fusion learning.
arXiv Detail & Related papers (2023-05-11T10:55:34Z) - SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation [89.47132156950194]
We present a novel framework built to simplify 3D asset generation for amateur users.
Our method supports a variety of input modalities that can be easily provided by a human.
Our model can combine all these tasks into one swiss-army-knife tool.
arXiv Detail & Related papers (2022-12-08T18:59:05Z) - TransFuse: A Unified Transformer-based Image Fusion Framework using
Self-supervised Learning [5.849513679510834]
Image fusion is a technique to integrate information from multiple source images with complementary information to improve the richness of a single image.
Two-stage methods avoid the need of large amount of task-specific training data by training encoder-decoder network on large natural image datasets.
We propose a destruction-reconstruction based self-supervised training scheme to encourage the network to learn task-specific features.
arXiv Detail & Related papers (2022-01-19T07:30:44Z) - TVDIM: Enhancing Image Self-Supervised Pretraining via Noisy Text Data [13.68491474904529]
We propose Text-enhanced Visual Deep InfoMax (TVDIM) to learn better visual representations.
Our core idea of self-supervised learning is to maximize the mutual information between features extracted from multiple views.
TVDIM significantly outperforms previous visual self-supervised methods when processing the same set of images.
arXiv Detail & Related papers (2021-06-03T12:36:01Z) - IMAGINE: Image Synthesis by Image-Guided Model Inversion [79.4691654458141]
We introduce an inversion based method, denoted as IMAge-Guided model INvErsion (IMAGINE), to generate high-quality and diverse images.
We leverage the knowledge of image semantics from a pre-trained classifier to achieve plausible generations.
IMAGINE enables the synthesis procedure to simultaneously 1) enforce semantic specificity constraints during the synthesis, 2) produce realistic images without generator training, and 3) give users intuitive control over the generation process.
arXiv Detail & Related papers (2021-04-13T02:00:24Z) - Multimodal Face Synthesis from Visual Attributes [85.87796260802223]
We propose a novel generative adversarial network that simultaneously synthesizes identity preserving multimodal face images.
multimodal stretch-in modules are introduced in the discriminator which discriminates between real and fake images.
arXiv Detail & Related papers (2021-04-09T13:47:23Z) - AE-Net: Autonomous Evolution Image Fusion Method Inspired by Human
Cognitive Mechanism [34.57055312296812]
We propose a robust and general image fusion method with autonomous evolution ability, denoted with AE-Net.
Through the collaborative optimization of multiple image fusion methods to simulate the cognitive process of human brain, unsupervised learning image fusion task can be transformed into semi-supervised image fusion task or supervised image fusion task.
Our image fusion method can effectively unify the cross-modal image fusion task and the same modal image fusion task, and effectively overcome the difference of data distribution between different datasets.
arXiv Detail & Related papers (2020-07-17T05:19:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.