Teleportraits: Training-Free People Insertion into Any Scene
- URL: http://arxiv.org/abs/2510.05660v1
- Date: Tue, 07 Oct 2025 08:12:57 GMT
- Title: Teleportraits: Training-Free People Insertion into Any Scene
- Authors: Jialu Gao, K J Joseph, Fernando De La Torre,
- Abstract summary: We introduce a unified training-free pipeline that leverages pre-trained text-to-image diffusion models.<n>We show that diffusion models inherently possess the knowledge to place people in complex scenes without requiring task-specific training.<n>Our method achieves affordance-aware global editing, seamlessly inserting people into scenes.
- Score: 59.76038137014233
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The task of realistically inserting a human from a reference image into a background scene is highly challenging, requiring the model to (1) determine the correct location and poses of the person and (2) perform high-quality personalization conditioned on the background. Previous approaches often treat them as separate problems, overlooking their interconnections, and typically rely on training to achieve high performance. In this work, we introduce a unified training-free pipeline that leverages pre-trained text-to-image diffusion models. We show that diffusion models inherently possess the knowledge to place people in complex scenes without requiring task-specific training. By combining inversion techniques with classifier-free guidance, our method achieves affordance-aware global editing, seamlessly inserting people into scenes. Furthermore, our proposed mask-guided self-attention mechanism ensures high-quality personalization, preserving the subject's identity, clothing, and body features from just a single reference image. To the best of our knowledge, we are the first to perform realistic human insertions into scenes in a training-free manner and achieve state-of-the-art results in diverse composite scene images with excellent identity preservation in backgrounds and subjects.
Related papers
- From Wardrobe to Canvas: Wardrobe Polyptych LoRA for Part-level Controllable Human Image Generation [44.46447676191666]
We present Wardrobe Polyptych LoRA, a controllable part-level controllable model for personalized human image generation.<n>By training only LoRA layers, our method removes the computational burden at inference while ensuring high-fidelity synthesis of unseen subjects.<n>Our approach significantly outperforms existing techniques in fidelity and consistency, enabling realistic and identity-preserving full-body synthesis.
arXiv Detail & Related papers (2025-07-14T12:34:25Z) - Person-In-Situ: Scene-Consistent Human Image Insertion with Occlusion-Aware Pose Control [1.529342790344802]
Existing methods can't handle occlusion of the inserted person by foreground objects and unnaturally place the person in the frontmost layer.<n>We propose two methods to address these challenges.<n>Both allow explicit pose control via a 3D body model and leverage latent diffusion models to synthesize the person at a contextually appropriate depth.
arXiv Detail & Related papers (2025-05-07T01:47:15Z) - Learning Complex Non-Rigid Image Edits from Multimodal Conditioning [18.500715348636582]
We focus on inserting a given human (specifically, a single image of a person) into a novel scene.<n>Our method, which builds on top of Stable Diffusion, yields natural looking images while being highly controllable with text and pose.<n>We demonstrate that identity preservation is a more challenging task in scenes "in-the-wild", and especially scenes where there is an interaction between persons and objects.
arXiv Detail & Related papers (2024-12-13T15:41:08Z) - Text2Place: Affordance-aware Text Guided Human Placement [26.041917073228483]
This work tackles the problem of realistic human insertion in a given background scene termed as textbfSemantic Human Placement.
For learning semantic masks, we leverage rich object-scene priors learned from the text-to-image generative models.
The proposed method can generate highly realistic scene compositions while preserving the background and subject identity.
arXiv Detail & Related papers (2024-07-22T08:00:06Z) - Training-Free Consistent Text-to-Image Generation [80.4814768762066]
Text-to-image models can portray the same subject across diverse prompts.
Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects.
We present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model.
arXiv Detail & Related papers (2024-02-05T18:42:34Z) - StableVITON: Learning Semantic Correspondence with Latent Diffusion
Model for Virtual Try-On [35.227896906556026]
Given a clothing image and a person image, an image-based virtual try-on aims to generate a customized image that appears natural and accurately reflects the characteristics of the clothing image.
In this work, we aim to expand the applicability of the pre-trained diffusion model so that it can be utilized independently for the virtual try-on task.
Our proposed zero cross-attention blocks not only preserve the clothing details by learning the semantic correspondence but also generate high-fidelity images by utilizing the inherent knowledge of the pre-trained model in the warping process.
arXiv Detail & Related papers (2023-12-04T08:27:59Z) - Photoswap: Personalized Subject Swapping in Images [56.2650908740358]
Photoswap learns the visual concept of the subject from reference images and swaps it into the target image using pre-trained diffusion models.
Photoswap significantly outperforms baseline methods in human ratings across subject swapping, background preservation, and overall quality.
arXiv Detail & Related papers (2023-05-29T17:56:13Z) - Putting People in Their Place: Affordance-Aware Human Insertion into
Scenes [61.63825003487104]
We study the problem of inferring scene affordances by presenting a method for realistically inserting people into scenes.
Given a scene image with a marked region and an image of a person, we insert the person into the scene while respecting the scene affordances.
Our model can infer the set of realistic poses given the scene context, re-pose the reference person, and harmonize the composition.
arXiv Detail & Related papers (2023-04-27T17:59:58Z) - Pose-Guided Human Animation from a Single Image in the Wild [83.86903892201656]
We present a new pose transfer method for synthesizing a human animation from a single image of a person controlled by a sequence of body poses.
Existing pose transfer methods exhibit significant visual artifacts when applying to a novel scene.
We design a compositional neural network that predicts the silhouette, garment labels, and textures.
We are able to synthesize human animations that can preserve the identity and appearance of the person in a temporally coherent way without any fine-tuning of the network on the testing scene.
arXiv Detail & Related papers (2020-12-07T15:38:29Z) - Scene Text Synthesis for Efficient and Effective Deep Network Training [62.631176120557136]
We develop an innovative image synthesis technique that composes annotated training images by embedding foreground objects of interest into background images.
The proposed technique consists of two key components that in principle boost the usefulness of the synthesized images in deep network training.
Experiments over a number of public datasets demonstrate the effectiveness of our proposed image synthesis technique.
arXiv Detail & Related papers (2019-01-26T10:15:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.