Related papers: Real-Time Intuitive AI Drawing System for Collaboration: Enhancing Human Creativity through Formal and Contextual Intent Integration

Real-Time Intuitive AI Drawing System for Collaboration: Enhancing Human Creativity through Formal and Contextual Intent Integration

URL: http://arxiv.org/abs/2508.19254v1
Date: Tue, 12 Aug 2025 01:34:23 GMT
Title: Real-Time Intuitive AI Drawing System for Collaboration: Enhancing Human Creativity through Formal and Contextual Intent Integration
Authors: Jookyung Song, Mookyoung Kang, Nojun Kwak,
Abstract summary: This paper presents a real-time generative drawing system that interprets and integrates both formal intent and contextual intent.<n>The system achieves low-latency, two-stage transformation while supporting multi-user collaboration on shared canvases.
Score: 26.920087528015205
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper presents a real-time generative drawing system that interprets and integrates both formal intent - the structural, compositional, and stylistic attributes of a sketch - and contextual intent - the semantic and thematic meaning inferred from its visual content - into a unified transformation process. Unlike conventional text-prompt-based generative systems, which primarily capture high-level contextual descriptions, our approach simultaneously analyzes ground-level intuitive geometric features such as line trajectories, proportions, and spatial arrangement, and high-level semantic cues extracted via vision-language models. These dual intent signals are jointly conditioned in a multi-stage generation pipeline that combines contour-preserving structural control with style- and content-aware image synthesis. Implemented with a touchscreen-based interface and distributed inference architecture, the system achieves low-latency, two-stage transformation while supporting multi-user collaboration on shared canvases. The resulting platform enables participants, regardless of artistic expertise, to engage in synchronous, co-authored visual creation, redefining human-AI interaction as a process of co-creation and mutual enhancement.

Related papers

Communication-Inspired Tokenization for Structured Image Representations [74.17163003465537]
COMmunication inspired Tokenization (COMiT) is a framework for learning structured discrete visual token sequences.<n>Our experiments demonstrate that while semantic alignment provides grounding, attentive sequential tokenization is critical for inducing interpretable, object-centric token structure.
arXiv Detail & Related papers (2026-02-24T09:53:50Z)
Two-Stream Interactive Joint Learning of Scene Parsing and Geometric Vision Tasks [24.19752468668527]
Two Interactive Streams (TwInS) is a novel bio-inspired joint learning framework capable of simultaneously performing scene parsing and geometric vision tasks.<n>To eliminate the dependence on costly human-annotated correspondence ground truth, TwInS is equipped with a tailored semi-supervised training strategy.
arXiv Detail & Related papers (2026-02-14T04:11:19Z)
Sissi: Zero-shot Style-guided Image Synthesis via Semantic-style Integration [57.02757226679549]
We introduce a training-free framework that reformulates style-guided synthesis as an in-context learning task.<n>We propose a Dynamic Semantic-Style Integration (DSSI) mechanism that reweights attention between semantic and style visual tokens.<n>Experiments show that our approach achieves high-fidelity stylization with superior semantic-style balance and visual quality.
arXiv Detail & Related papers (2026-01-10T16:01:14Z)
EVLP:Learning Unified Embodied Vision-Language Planner with Reinforced Supervised Fine-Tuning [44.254412516852874]
Current methods fail to adopt a unified generation framework for multimodal planning, lead to inconsistent in multimodal planning.<n>Our approach achieves multimodal planning for long-horizon tasks through a novel training pipeline incorporating dynamic pretraining and reinforced alignment.
arXiv Detail & Related papers (2025-11-03T10:24:49Z)
Mind-the-Glitch: Visual Correspondence for Detecting Inconsistencies in Subject-Driven Generation [120.23172120151821]
We propose a novel approach for disentangling visual and semantic features from the backbones of pre-trained diffusion models.<n>We introduce an automated pipeline that constructs image pairs with annotated semantic and visual correspondences.<n>We propose a new metric, Visual Semantic Matching, that quantifies visual inconsistencies in subject-driven image generation.
arXiv Detail & Related papers (2025-09-26T07:11:55Z)
PosterForest: Hierarchical Multi-Agent Collaboration for Scientific Poster Generation [28.02969134846803]
We introduce the textitPoster Tree, a hierarchical intermediate representation that jointly encodes document structure and visual-textual relationships.<n>Our framework employs a multi-agent collaboration strategy, where agents specializing in content summarization and layout planning iteratively coordinate and provide mutual feedback.
arXiv Detail & Related papers (2025-08-29T15:36:06Z)
Cross-Modal Prototype Augmentation and Dual-Grained Prompt Learning for Social Media Popularity Prediction [16.452218354378452]
Social Media Popularity Prediction is a complex task that requires effective integration of images, text, and structured information.<n>We introduce hierarchical prototypes for structural enhancement and contrastive learning for improved vision-text alignment.<n>We propose a feature-enhanced framework integrating dual-grained prompt learning and cross-modal attention mechanisms.
arXiv Detail & Related papers (2025-08-22T07:16:47Z)
High Fidelity Text to Image Generation with Contrastive Alignment and Structural Guidance [0.0]
This paper addresses the performance of existing text-driven image generation methods in terms of semantic alignment accuracy and structural consistency.<n>A high-fidelity image generation method is proposed by integrating text-image contrastive constraints with structural guidance mechanisms.<n>The results show that the method effectively bridges the gap between semantic alignment and structural fidelity without increasing computational complexity.
arXiv Detail & Related papers (2025-08-14T02:15:11Z)
Piece it Together: Part-Based Concepting with IP-Priors [52.01640707131325]
We introduce a generative framework that seamlessly integrates a partial set of user-provided visual components into a coherent composition.<n>Our approach builds on a strong and underexplored representation space, extracted from IP-Adapter+.<n>We also present a LoRA-based fine-tuning strategy that significantly improves prompt adherence in IP-Adapter+ for a given task.
arXiv Detail & Related papers (2025-03-13T13:46:10Z)
Duplex: Dual Prototype Learning for Compositional Zero-Shot Learning [17.013498508426398]
Compositional Zero-Shot Learning (CZSL) aims to enable models to recognize novel compositions of visual states and objects that were absent during training.<n>We propose Duplex, a novel dual-prototype learning method that integrates semantic and visual prototypes through a carefully designed dual-branch architecture.
arXiv Detail & Related papers (2025-01-13T08:04:32Z)
MetaDesigner: Advancing Artistic Typography Through AI-Driven, User-Centric, and Multilingual WordArt Synthesis [65.78359025027457]
MetaDesigner introduces a transformative framework for artistic typography, powered by Large Language Models (LLMs)<n>Its foundation is a multi-agent system comprising the Pipeline, Glyph, and Texture agents, which collectively orchestrate the creation of customizable WordArt.
arXiv Detail & Related papers (2024-06-28T11:58:26Z)
Person-in-Context Synthesiswith Compositional Structural Space [59.129960774988284]
We propose a new problem, textbfPersons in Context Synthesis, which aims to synthesize diverse person instance(s) in consistent contexts. The context is specified by the bounding box object layout which lacks shape information, while pose of the person(s) by keypoints which are sparsely annotated. To handle the stark difference in input structures, we proposed two separate neural branches to attentively composite the respective (context/person) inputs into shared compositional structural space'' This structural space is then decoded to the image space using multi-level feature modulation strategy, and learned in a self
arXiv Detail & Related papers (2020-08-28T14:33:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.