Related papers: ScribbleSense: Generative Scribble-Based Texture Editing with Intent Prediction

ScribbleSense: Generative Scribble-Based Texture Editing with Intent Prediction

URL: http://arxiv.org/abs/2601.22455v1
Date: Fri, 30 Jan 2026 01:55:44 GMT
Title: ScribbleSense: Generative Scribble-Based Texture Editing with Intent Prediction
Authors: Yudi Zhang, Yeming Geng, Lei Zhang,
Abstract summary: ScribbleSense is an editing method that combines multimodal large language models (MLLMs) and image generation models.<n>We leverage the visual capabilities of MLLMs to predict the editing intent behind the scribbles.<n>Globally generated images are employed to extract local texture details.
Score: 5.109590115201006
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Interactive 3D model texture editing presents enhanced opportunities for creating 3D assets, with freehand drawing style offering the most intuitive experience. However, existing methods primarily support sketch-based interactions for outlining, while the utilization of coarse-grained scribble-based interaction remains limited. Furthermore, current methodologies often encounter challenges due to the abstract nature of scribble instructions, which can result in ambiguous editing intentions and unclear target semantic locations. To address these issues, we propose ScribbleSense, an editing method that combines multimodal large language models (MLLMs) and image generation models to effectively resolve these challenges. We leverage the visual capabilities of MLLMs to predict the editing intent behind the scribbles. Once the semantic intent of the scribble is discerned, we employ globally generated images to extract local texture details, thereby anchoring local semantics and alleviating ambiguities concerning the target semantic locations. Experimental results indicate that our method effectively leverages the strengths of MLLMs, achieving state-of-the-art interactive editing performance for scribble-based texture editing.

Related papers

DreamOmni3: Scribble-based Editing and Generation [72.52583595391944]
We introduce Dream Omni3, tackling two challenges: data creation and framework design.<n>For scribble-based editing, we define four tasks: scribble and instruction-based editing, scribble and multimodal instruction-based generation, and doodle generation.<n>For the framework, instead of using binary masks, we propose a joint input scheme that feeds both the original and scribbled source images into the model.
arXiv Detail & Related papers (2025-12-27T09:07:12Z)
SmartFreeEdit: Mask-Free Spatial-Aware Image Editing with Complex Instruction Understanding [46.767486063775266]
SmartFreeEdit is an end-to-end framework that integrates a multimodal large language model (MLLM) with a hypergraph-enhanced inpainting architecture.<n>Key innovations of SmartFreeEdit include region aware tokens and a mask embedding paradigm.<n>Experiments on the Reason-Edit benchmark demonstrate that SmartFreeEdit surpasses current state-of-the-art methods.
arXiv Detail & Related papers (2025-04-17T07:17:49Z)
BrushEdit: All-In-One Image Inpainting and Editing [76.93556996538398]
BrushEdit is a novel inpainting-based instruction-guided image editing paradigm.<n>We devise a system enabling free-form instruction editing by integrating MLLMs and a dual-branch image inpainting model.<n>Our framework effectively combines MLLMs and inpainting models, achieving superior performance across seven metrics.
arXiv Detail & Related papers (2024-12-13T17:58:06Z)
DiffUHaul: A Training-Free Method for Object Dragging in Images [78.93531472479202]
We propose a training-free method, dubbed DiffUHaul, for the object dragging task. We first apply attention masking in each denoising step to make the generation more disentangled across different objects. In the early denoising steps, we interpolate the attention features between source and target images to smoothly fuse new layouts with the original appearance.
arXiv Detail & Related papers (2024-06-03T17:59:53Z)
TexSliders: Diffusion-Based Texture Editing in CLIP Space [17.449209402077276]
We analyze existing editing methods and show that they are not directly applicable to textures. We propose a novel approach that instead manipulates CLIP image embeddings to condition the diffusion generation.
arXiv Detail & Related papers (2024-05-01T17:57:21Z)
Text-Driven Image Editing via Learnable Regions [74.45313434129005]
We introduce a method for region-based image editing driven by textual prompts, without the need for user-provided masks or sketches. We show that this simple approach enables flexible editing that is compatible with current image generation models. Experiments demonstrate the competitive performance of our method in manipulating images with high fidelity and realism that correspond to the provided language descriptions.
arXiv Detail & Related papers (2023-11-28T02:27:31Z)
Directional Texture Editing for 3D Models [51.31499400557996]
ITEM3D is designed for automatic textbf3D object editing according to the text textbfInstructions. Leveraging the diffusion models and the differentiable rendering, ITEM3D takes the rendered images as the bridge of text and 3D representation.
arXiv Detail & Related papers (2023-09-26T12:01:13Z)
Towards Counterfactual Image Manipulation via CLIP [106.94502632502194]
Existing methods can achieve realistic editing of different visual attributes such as age and gender of facial images. We investigate this problem in a text-driven manner with Contrastive-Language-Image-Pretraining (CLIP) We design a novel contrastive loss that exploits predefined CLIP-space directions to guide the editing toward desired directions from different perspectives.
arXiv Detail & Related papers (2022-07-06T17:02:25Z)
Blended Diffusion for Text-driven Editing of Natural Images [18.664733153082146]
We introduce the first solution for performing local (region-based) edits in generic natural images. We achieve our goal by leveraging and combining a pretrained language-image model (CLIP) To seamlessly fuse the edited region with the unchanged parts of the image, we spatially blend noised versions of the input image with the local text-guided diffusion latent.
arXiv Detail & Related papers (2021-11-29T18:58:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.