Beyond Realism: Learning the Art of Expressive Composition with StickerNet
- URL: http://arxiv.org/abs/2511.20957v1
- Date: Wed, 26 Nov 2025 01:24:30 GMT
- Title: Beyond Realism: Learning the Art of Expressive Composition with StickerNet
- Authors: Haoming Lu, David Kocharian, Humphrey Shi,
- Abstract summary: We present StickerNet, a two-stage framework that first determines the composition type, then predicts placement parameters such as opacity, mask, location, and scale accordingly.<n>Unlike prior work that constructs datasets by simulating object placements on real images, we directly build our dataset from 1.8 million editing actions collected on an anonymous online visual creation and editing platform.<n>User studies and quantitative evaluations show that StickerNet outperforms common baselines and closely matches human placement behavior.
- Score: 38.113801584146024
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As a widely used operation in image editing workflows, image composition has traditionally been studied with a focus on achieving visual realism and semantic plausibility. However, in practical editing scenarios of the modern content creation landscape, many compositions are not intended to preserve realism. Instead, users of online platforms motivated by gaining community recognition often aim to create content that is more artistic, playful, or socially engaging. Taking inspiration from this observation, we define the expressive composition task, a new formulation of image composition that embraces stylistic diversity and looser placement logic, reflecting how users edit images on real-world creative platforms. To address this underexplored problem, we present StickerNet, a two-stage framework that first determines the composition type, then predicts placement parameters such as opacity, mask, location, and scale accordingly. Unlike prior work that constructs datasets by simulating object placements on real images, we directly build our dataset from 1.8 million editing actions collected on an anonymous online visual creation and editing platform, each reflecting user-community validated placement decisions. This grounding in authentic editing behavior ensures strong alignment between task definition and training supervision. User studies and quantitative evaluations show that StickerNet outperforms common baselines and closely matches human placement behavior, demonstrating the effectiveness of learning from real-world editing patterns despite the inherent ambiguity of the task. This work introduces a new direction in visual understanding that emphasizes expressiveness and user intent over realism.
Related papers
- ScribbleSense: Generative Scribble-Based Texture Editing with Intent Prediction [5.109590115201006]
ScribbleSense is an editing method that combines multimodal large language models (MLLMs) and image generation models.<n>We leverage the visual capabilities of MLLMs to predict the editing intent behind the scribbles.<n>Globally generated images are employed to extract local texture details.
arXiv Detail & Related papers (2026-01-30T01:55:44Z) - Neural Scene Designer: Self-Styled Semantic Image Manipulation [67.43125248646653]
We introduce the Neural Scene Designer (NSD), a novel framework that enables photo-realistic manipulation of user-specified scene regions.<n>NSD ensures both semantic alignment with user intent and stylistic consistency with the surrounding environment.<n>To capture fine-grained style representations, we propose the Progressive Self-style Representational Learning (PSRL) module.
arXiv Detail & Related papers (2025-09-01T11:59:03Z) - SPIE: Semantic and Structural Post-Training of Image Editing Diffusion Models with AI feedback [28.807572302899004]
SPIE is a novel approach for semantic and structural post-training of instruction-based image editing diffusion models.<n>We introduce an online reinforcement learning framework that aligns the diffusion model with human preferences without relying on extensive human annotations.<n> Experimental results demonstrate that SPIE can perform intricate edits in complex scenes, after just 10 training steps.
arXiv Detail & Related papers (2025-04-17T10:46:39Z) - MagicTailor: Component-Controllable Personalization in Text-to-Image Diffusion Models [51.1034358143232]
We introduce component-controllable personalization, a new task that enables users to customize and reconfigure individual components within concepts.<n>This task faces two challenges: semantic pollution, where undesired elements disrupt the target concept, and semantic imbalance, which causes disproportionate learning of the target concept and component.<n>We design MagicTailor, a framework that uses Dynamic Masked Degradation to adaptively perturb unwanted visual semantics and Dual-Stream Balancing for more balanced learning of desired visual semantics.
arXiv Detail & Related papers (2024-10-17T09:22:53Z) - Empowering Visual Creativity: A Vision-Language Assistant to Image Editing Recommendations [109.65267337037842]
We introduce the task of Image Editing Recommendation (IER)
IER aims to automatically generate diverse creative editing instructions from an input image and a simple prompt representing the users' under-specified editing purpose.
We introduce Creativity-Vision Language Assistant(Creativity-VLA), a multimodal framework designed specifically for edit-instruction generation.
arXiv Detail & Related papers (2024-05-31T18:22:29Z) - Highly Personalized Text Embedding for Image Manipulation by Stable
Diffusion [34.662798793560995]
We present a simple yet highly effective approach to personalization using highly personalized (PerHi) text embedding.
Our method does not require model fine-tuning or identifiers, yet still enables manipulation of background, texture, and motion with just a single image and target text.
arXiv Detail & Related papers (2023-03-15T17:07:45Z) - Structure-Guided Image Completion with Image-level and Object-level Semantic Discriminators [97.12135238534628]
We propose a learning paradigm that consists of semantic discriminators and object-level discriminators for improving the generation of complex semantics and objects.
Specifically, the semantic discriminators leverage pretrained visual features to improve the realism of the generated visual concepts.
Our proposed scheme significantly improves the generation quality and achieves state-of-the-art results on various tasks.
arXiv Detail & Related papers (2022-12-13T01:36:56Z) - Towards Counterfactual Image Manipulation via CLIP [106.94502632502194]
Existing methods can achieve realistic editing of different visual attributes such as age and gender of facial images.
We investigate this problem in a text-driven manner with Contrastive-Language-Image-Pretraining (CLIP)
We design a novel contrastive loss that exploits predefined CLIP-space directions to guide the editing toward desired directions from different perspectives.
arXiv Detail & Related papers (2022-07-06T17:02:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.