UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics
- URL: http://arxiv.org/abs/2412.07774v2
- Date: Wed, 11 Dec 2024 22:51:08 GMT
- Title: UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics
- Authors: Xi Chen, Zhifei Zhang, He Zhang, Yuqian Zhou, Soo Ye Kim, Qing Liu, Yijun Li, Jianming Zhang, Nanxuan Zhao, Yilin Wang, Hui Ding, Zhe Lin, Hengshuang Zhao,
- Abstract summary: UniReal is a unified framework designed to address various image generation and editing tasks.<n>Inspired by recent video generation models, we propose a unifying approach that treats image-level tasks as discontinuous video generation.<n>Although designed for image-level tasks, we leverage videos as a scalable source for universal supervision.
- Score: 74.10447111842504
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce UniReal, a unified framework designed to address various image generation and editing tasks. Existing solutions often vary by tasks, yet share fundamental principles: preserving consistency between inputs and outputs while capturing visual variations. Inspired by recent video generation models that effectively balance consistency and variation across frames, we propose a unifying approach that treats image-level tasks as discontinuous video generation. Specifically, we treat varying numbers of input and output images as frames, enabling seamless support for tasks such as image generation, editing, customization, composition, etc. Although designed for image-level tasks, we leverage videos as a scalable source for universal supervision. UniReal learns world dynamics from large-scale videos, demonstrating advanced capability in handling shadows, reflections, pose variation, and object interaction, while also exhibiting emergent capability for novel applications.
Related papers
- VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning [68.98988753763666]
We propose VisualCloze, a universal image generation framework.
VisualCloze supports a wide range of in-domain tasks, generalization to unseen ones, unseen unification of multiple tasks, and reverse generation.
We introduce Graph200K, a graph-structured dataset that establishes various interrelated tasks, enhancing task density and transferable knowledge.
arXiv Detail & Related papers (2025-04-10T17:59:42Z) - RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models [22.042487298092883]
RealGeneral is a novel framework that reformulates image generation as a conditional frame prediction task.
It mitigates a 14.5% improvement in subject similarity for customized generation and a 10% enhancement in image quality for canny-to-image task.
arXiv Detail & Related papers (2025-03-13T14:31:52Z) - Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks [25.39030226963548]
We introduce the first application of a pretrained transformer-based video generative model for portrait animation.<n>Our method is validated through experiments on benchmark and newly proposed wild datasets.
arXiv Detail & Related papers (2024-12-01T08:54:30Z) - Synergizing Motion and Appearance: Multi-Scale Compensatory Codebooks for Talking Head Video Generation [15.233839480474206]
Talking head video generation aims to generate a realistic talking head video that preserves the person's identity from a source image and the motion from a driving video.<n>Despite the promising progress made in the field, it remains a challenging and critical problem to generate videos with accurate poses and fine-grained facial details simultaneously.<n>We propose to jointly learn motion and appearance codebooks and perform multi-scale codebook compensation to effectively refine both the facial motion conditions and appearance features.
arXiv Detail & Related papers (2024-12-01T07:54:07Z) - Transforming Static Images Using Generative Models for Video Salient Object Detection [15.701293552584863]
We show that image-to-video diffusion models can generate realistic transformations of static images while understanding the contextual relationships between image components.
This ability allows the model to generate plausible optical flows, preserving semantic integrity while reflecting the independent motion of scene elements.
Our approach achieves state-of-the-art performance across all public benchmark datasets, outperforming existing approaches.
arXiv Detail & Related papers (2024-11-21T09:41:33Z) - Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing [150.0380447353081]
We present VITRON, a universal pixel-level vision LLM designed for comprehensive understanding, segmenting, and clusters of both static images and dynamic videos.
Building on top of an LLM, VITRON incorporates encoders for images, videos, and pixel-level regional visuals within its modules, while employing state-of-the-art visual specialists as its backend.
arXiv Detail & Related papers (2024-10-08T08:39:04Z) - Multi-object Video Generation from Single Frame Layouts [84.55806837855846]
We propose a video generative framework capable of synthesizing global scenes with local objects.
Our framework is a non-trivial adaptation from image generation methods, and is new to this field.
Our model has been evaluated on two widely-used video recognition benchmarks.
arXiv Detail & Related papers (2023-05-06T09:07:01Z) - Learning Universal Policies via Text-Guided Video Generation [179.6347119101618]
A goal of artificial intelligence is to construct an agent that can solve a wide variety of tasks.
Recent progress in text-guided image synthesis has yielded models with an impressive ability to generate complex novel images.
We investigate whether such tools can be used to construct more general-purpose agents.
arXiv Detail & Related papers (2023-01-31T21:28:13Z) - Plug-and-Play Diffusion Features for Text-Driven Image-to-Image
Translation [10.39028769374367]
We present a new framework that takes text-to-image synthesis to the realm of image-to-image translation.
Our method harnesses the power of a pre-trained text-to-image diffusion model to generate a new image that complies with the target text.
arXiv Detail & Related papers (2022-11-22T20:39:18Z) - Self-Supervised Representation Learning from Flow Equivariance [97.13056332559526]
We present a new self-supervised learning representation framework that can be directly deployed on a video stream of complex scenes.
Our representations, learned from high-resolution raw video, can be readily used for downstream tasks on static images.
arXiv Detail & Related papers (2021-01-16T23:44:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.