ObjMST: An Object-Focused Multimodal Style Transfer Framework
- URL: http://arxiv.org/abs/2503.04353v1
- Date: Thu, 06 Mar 2025 11:55:44 GMT
- Title: ObjMST: An Object-Focused Multimodal Style Transfer Framework
- Authors: Chanda Grover Kamra, Indra Deep Mastan, Debayan Gupta,
- Abstract summary: We propose an object-focused multimodal style transfer framework that provides separate style supervision for salient objects and surrounding elements.<n>Existing image-text multimodal style transfer methods face the following challenges: (1) generating non-aligned and inconsistent multimodal style representations; and (2) content mismatch, where identical style patterns are applied to both salient objects and their surroundings.<n>Our approach mitigates these issues by: (1) introducing a Style-Specific Masked Directional CLIP Loss, which ensures consistent and aligned style representations for both salient objects and their surroundings; and (2) incorporating a salient-to-key mapping mechanism for stylizing salient objects, followed by image
- Score: 2.732041684677653
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose ObjMST, an object-focused multimodal style transfer framework that provides separate style supervision for salient objects and surrounding elements while addressing alignment issues in multimodal representation learning. Existing image-text multimodal style transfer methods face the following challenges: (1) generating non-aligned and inconsistent multimodal style representations; and (2) content mismatch, where identical style patterns are applied to both salient objects and their surrounding elements. Our approach mitigates these issues by: (1) introducing a Style-Specific Masked Directional CLIP Loss, which ensures consistent and aligned style representations for both salient objects and their surroundings; and (2) incorporating a salient-to-key mapping mechanism for stylizing salient objects, followed by image harmonization to seamlessly blend the stylized objects with their environment. We validate the effectiveness of ObjMST through experiments, using both quantitative metrics and qualitative visual evaluations of the stylized outputs. Our code is available at: https://github.com/chandagrover/ObjMST.
Related papers
- Local Prompt Adaptation for Style-Consistent Multi-Object Generation in Diffusion Models [0.0]
We propose a simple, training-free architectural method called Local Prompt Adaptation (LPA)<n>Our method decomposes the prompt into content and style tokens, and injects them selectively into the U-Net's attention layers at different stages.<n>We evaluate our method on a custom benchmark of 50 style-rich prompts across five categories and compare against strong baselines including Composer, MultiDiffusion, Attend-and-Excite, LoRA, and SDXL.
arXiv Detail & Related papers (2025-07-27T01:32:13Z) - ObjectRelator: Enabling Cross-View Object Relation Understanding in Ego-Centric and Exo-Centric Videos [105.40690994956667]
Ego-Exo Object Correspondence task aims to map objects across ego-centric and exo-centric views.<n>We introduce ObjectRelator, a novel method designed to tackle this task.
arXiv Detail & Related papers (2024-11-28T12:01:03Z) - MOSAIC: Multi-Object Segmented Arbitrary Stylization Using CLIP [0.0]
Style transfer driven by text prompts paved a new path for creatively stylizing the images without collecting an actual style image.
We propose a new method Multi-Object Segmented Arbitrary Stylization Using CLIP (MOSAIC) that can apply styles to different objects in the image based on the context extracted from the input prompt.
Our method can extend to any arbitrary objects, styles and produce high-quality images compared to the current state of art methods.
arXiv Detail & Related papers (2023-09-24T18:24:55Z) - Beyond One-to-One: Rethinking the Referring Image Segmentation [117.53010476628029]
Referring image segmentation aims to segment the target object referred by a natural language expression.
We propose a Dual Multi-Modal Interaction (DMMI) Network, which contains two decoder branches.
In the text-to-image decoder, text embedding is utilized to query the visual feature and localize the corresponding target.
Meanwhile, the image-to-text decoder is implemented to reconstruct the erased entity-phrase conditioned on the visual feature.
arXiv Detail & Related papers (2023-08-26T11:39:22Z) - Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features.
Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z) - Multi-modal Transformers Excel at Class-agnostic Object Detection [105.10403103027306]
We argue that existing methods lack a top-down supervision signal governed by human-understandable semantics.
We develop an efficient and flexible MViT architecture using multi-scale feature processing and deformable self-attention.
We show the significance of MViT proposals in a diverse range of applications.
arXiv Detail & Related papers (2021-11-22T18:59:29Z) - Context-Aware Layout to Image Generation with Enhanced Object Appearance [123.62597976732948]
A layout to image (L2I) generation model aims to generate a complicated image containing multiple objects (things) against natural background (stuff)
Existing L2I models have made great progress, but object-to-object and object-to-stuff relations are often broken.
We argue that these are caused by the lack of context-aware object and stuff feature encoding in their generators, and location-sensitive appearance representation in their discriminators.
arXiv Detail & Related papers (2021-03-22T14:43:25Z) - DeepObjStyle: Deep Object-based Photo Style Transfer [31.75300124593133]
One of the major challenges of style transfer is the appropriate image features supervision between the output image and the input (style and content) images.
We propose an object-based style transfer approach, called DeepStyle, for the style supervision in the training data-independent framework.
arXiv Detail & Related papers (2020-12-11T17:02:01Z) - Empty Cities: a Dynamic-Object-Invariant Space for Visual SLAM [6.693607456009373]
We present a data-driven approach to obtain the static image of a scene, eliminating dynamic objects that might have been present at the time of traversing the scene with a camera.
We introduce an end-to-end deep learning framework to turn images of an urban environment into realistic static frames suitable for localization and mapping.
arXiv Detail & Related papers (2020-10-15T10:31:12Z) - Object-Centric Image Generation from Layouts [93.10217725729468]
We develop a layout-to-image-generation method to generate complex scenes with multiple objects.
Our method learns representations of the spatial relationships between objects in the scene, which lead to our model's improved layout-fidelity.
We introduce SceneFID, an object-centric adaptation of the popular Fr'echet Inception Distance metric, that is better suited for multi-object images.
arXiv Detail & Related papers (2020-03-16T21:40:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.