SAEdit: Token-level control for continuous image editing via Sparse AutoEncoder
- URL: http://arxiv.org/abs/2510.05081v1
- Date: Mon, 06 Oct 2025 17:51:04 GMT
- Title: SAEdit: Token-level control for continuous image editing via Sparse AutoEncoder
- Authors: Ronen Kamenetsky, Sara Dorfman, Daniel Garibi, Roni Paiss, Or Patashnik, Daniel Cohen-Or,
- Abstract summary: We introduce a method for disentangled and continuous editing through token-level manipulation of text embeddings.<n>The edits are applied by manipulating the embeddings along carefully chosen directions, which control the strength of the target attribute.<n>Our method operates directly on text embeddings without modifying the diffusion process, making it model agnostic and broadly applicable to various image backbones.
- Score: 52.754326452329956
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale text-to-image diffusion models have become the backbone of modern image editing, yet text prompts alone do not offer adequate control over the editing process. Two properties are especially desirable: disentanglement, where changing one attribute does not unintentionally alter others, and continuous control, where the strength of an edit can be smoothly adjusted. We introduce a method for disentangled and continuous editing through token-level manipulation of text embeddings. The edits are applied by manipulating the embeddings along carefully chosen directions, which control the strength of the target attribute. To identify such directions, we employ a Sparse Autoencoder (SAE), whose sparse latent space exposes semantically isolated dimensions. Our method operates directly on text embeddings without modifying the diffusion process, making it model agnostic and broadly applicable to various image synthesis backbones. Experiments show that it enables intuitive and efficient manipulations with continuous control across diverse attributes and domains.
Related papers
- SliderEdit: Continuous Image Editing with Fine-Grained Instruction Control [50.76070785417023]
We introduce SliderEdit, a framework for continuous image editing with fine-grained, interpretable instruction control.<n>Given a multi-part edit instruction, SliderEdit disentangles the individual instructions and exposes each as a globally trained slider.<n>Our results pave the way for interactive, instruction-driven image manipulation with continuous and compositional control.
arXiv Detail & Related papers (2025-11-12T20:21:37Z) - Group Relative Attention Guidance for Image Editing [38.299491082179905]
Group Relative Attention Guidance (GRAG) is a simple yet effective method that modulates the focus of the model on the input image relative to the editing instruction.<n>Our code will be released at https://www.littlemisfit.com/little-misfit/GRAG-Image-Editing.
arXiv Detail & Related papers (2025-10-28T17:22:44Z) - Kontinuous Kontext: Continuous Strength Control for Instruction-based Image Editing [76.44219733285898]
Kontinuous Kontext is an instruction-driven editing model that provides a new dimension of control over edit strength.<n>A lightweight projector network maps the input scalar and the edit instruction to coefficients in the model's modulation space.<n>For training our model, we synthesize a diverse dataset of image-edit-instruction-strength quadruplets using existing generative models.
arXiv Detail & Related papers (2025-10-09T17:51:03Z) - TDEdit: A Unified Diffusion Framework for Text-Drag Guided Image Manipulation [51.72432192816058]
We propose a unified diffusion-based framework for joint drag-text image editing.<n>Our framework introduces two key innovations: (1) Point-Cloud Deterministic Drag, which enhances latent-space layout control through 3D feature mapping, and (2) Drag-Text Guided Denoising, dynamically balancing the influence of drag and text conditions during denoising.
arXiv Detail & Related papers (2025-09-26T05:39:03Z) - FireEdit: Fine-grained Instruction-based Image Editing via Region-aware Vision Language Model [54.693572837423226]
FireEdit is an innovative Fine-grained Instruction-based image editing framework that exploits a REgion-aware VLM.<n>FireEdit is designed to accurately comprehend user instructions and ensure effective control over the editing process.<n>Our approach surpasses the state-of-the-art instruction-based image editing methods.
arXiv Detail & Related papers (2025-03-25T16:59:42Z) - Latent Space Disentanglement in Diffusion Transformers Enables Precise Zero-shot Semantic Editing [4.948910649137149]
Diffusion Transformers (DiTs) have recently achieved remarkable success in text-guided image generation.
We show how multimodal information collectively forms this joint space and how they guide the semantics of the synthesized images.
We propose a simple yet effective Encode-Identify-Manipulate (EIM) framework for zero-shot fine-grained image editing.
arXiv Detail & Related papers (2024-11-12T21:34:30Z) - DragText: Rethinking Text Embedding in Point-based Image Editing [3.4248731707266264]
Point-based image editing enables accurate and flexible control through content dragging.<n>The role of text embedding during the editing process has not been thoroughly investigated.<n>We propose DragText, which optimize text embedding in conjunction with the dragging process to pair with the modified image embedding.
arXiv Detail & Related papers (2024-07-25T07:57:55Z) - MagicStick: Controllable Video Editing via Control Handle Transformations [49.29608051543133]
MagicStick is a controllable video editing method that edits the video properties by utilizing the transformation on the extracted internal control signals.
We present experiments on numerous examples within our unified framework.
We also compare with shape-aware text-based editing and handcrafted motion video generation, demonstrating our superior temporal consistency and editing capability than previous works.
arXiv Detail & Related papers (2023-12-05T17:58:06Z) - LayerDiffusion: Layered Controlled Image Editing with Diffusion Models [5.58892860792971]
LayerDiffusion is a semantic-based layered controlled image editing method.
We leverage a large-scale text-to-image model and employ a layered controlled optimization strategy.
Experimental results demonstrate the effectiveness of our method in generating highly coherent images.
arXiv Detail & Related papers (2023-05-30T01:26:41Z) - Zero-shot Image-to-Image Translation [57.46189236379433]
We propose pix2pix-zero, an image-to-image translation method that can preserve the original image without manual prompting.
We propose cross-attention guidance, which aims to retain the cross-attention maps of the input image throughout the diffusion process.
Our method does not need additional training for these edits and can directly use the existing text-to-image diffusion model.
arXiv Detail & Related papers (2023-02-06T18:59:51Z) - $S^2$-Flow: Joint Semantic and Style Editing of Facial Images [16.47093005910139]
generative adversarial networks (GANs) have motivated investigations into their application for image editing.
GANs are often limited in the control they provide for performing specific edits.
We propose a method to disentangle a GAN$text'$s latent space into semantic and style spaces.
arXiv Detail & Related papers (2022-11-22T12:00:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.