DiffCloth: Diffusion Based Garment Synthesis and Manipulation via
Structural Cross-modal Semantic Alignment
- URL: http://arxiv.org/abs/2308.11206v1
- Date: Tue, 22 Aug 2023 05:43:33 GMT
- Title: DiffCloth: Diffusion Based Garment Synthesis and Manipulation via
Structural Cross-modal Semantic Alignment
- Authors: Xujie Zhang, Binbin Yang, Michael C. Kampffmeyer, Wenqing Zhang,
Shiyue Zhang, Guansong Lu, Liang Lin, Hang Xu, Xiaodan Liang
- Abstract summary: Cross-modal garment synthesis and manipulation will significantly benefit the way fashion designers generate garments.
We introduce DiffCloth, a diffusion-based pipeline for cross-modal garment synthesis and manipulation.
Experiments on the CM-Fashion benchmark demonstrate that DiffCloth both yields state-of-the-art garment synthesis results.
- Score: 124.57488600605822
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Cross-modal garment synthesis and manipulation will significantly benefit the
way fashion designers generate garments and modify their designs via flexible
linguistic interfaces.Current approaches follow the general text-to-image
paradigm and mine cross-modal relations via simple cross-attention modules,
neglecting the structural correspondence between visual and textual
representations in the fashion design domain. In this work, we instead
introduce DiffCloth, a diffusion-based pipeline for cross-modal garment
synthesis and manipulation, which empowers diffusion models with flexible
compositionality in the fashion domain by structurally aligning the cross-modal
semantics. Specifically, we formulate the part-level cross-modal alignment as a
bipartite matching problem between the linguistic Attribute-Phrases (AP) and
the visual garment parts which are obtained via constituency parsing and
semantic segmentation, respectively. To mitigate the issue of attribute
confusion, we further propose a semantic-bundled cross-attention to preserve
the spatial structure similarities between the attention maps of attribute
adjectives and part nouns in each AP. Moreover, DiffCloth allows for
manipulation of the generated results by simply replacing APs in the text
prompts. The manipulation-irrelevant regions are recognized by blended masks
obtained from the bundled attention maps of the APs and kept unchanged.
Extensive experiments on the CM-Fashion benchmark demonstrate that DiffCloth
both yields state-of-the-art garment synthesis results by leveraging the
inherent structural information and supports flexible manipulation with region
consistency.
Related papers
- Learning Spectral-Decomposed Tokens for Domain Generalized Semantic Segmentation [38.0401463751139]
We present a novel Spectral-dEcomposed Token (SET) learning framework to advance the frontier.
Particularly, the frozen VFM features are first decomposed into the phase and amplitude components in the frequency space.
We develop an attention optimization method to bridge the gap between style-affected representation and static tokens during inference.
arXiv Detail & Related papers (2024-07-26T07:50:48Z) - Diffusion Features to Bridge Domain Gap for Semantic Segmentation [2.8616666231199424]
This paper investigates the approach that leverages the sampling and fusion techniques to harness the features of diffusion models efficiently.
By leveraging the strength of text-to-image generation capability, we introduce a new training framework designed to implicitly learn posterior knowledge from it.
arXiv Detail & Related papers (2024-06-02T15:33:46Z) - Contextualized Diffusion Models for Text-Guided Image and Video Generation [67.69171154637172]
Conditional diffusion models have exhibited superior performance in high-fidelity text-guided visual generation and editing.
We propose a novel and general contextualized diffusion model (ContextDiff) by incorporating the cross-modal context encompassing interactions and alignments between text condition and visual sample.
We generalize our model to both DDPMs and DDIMs with theoretical derivations, and demonstrate the effectiveness of our model in evaluations with two challenging tasks: text-to-image generation, and text-to-video editing.
arXiv Detail & Related papers (2024-02-26T15:01:16Z) - Cross2StrA: Unpaired Cross-lingual Image Captioning with Cross-lingual
Cross-modal Structure-pivoted Alignment [81.00183950655924]
Unpaired cross-lingual image captioning has long suffered from irrelevancy and disfluency issues.
In this work, we propose to address the above problems by incorporating the scene graph (SG) structures and the syntactic constituency (SC) trees.
Our captioner contains the semantic structure-guided image-to-pivot captioning and the syntactic structure-guided pivot-to-target translation.
arXiv Detail & Related papers (2023-05-20T18:30:03Z) - Graph Adaptive Semantic Transfer for Cross-domain Sentiment
Classification [68.06496970320595]
Cross-domain sentiment classification (CDSC) aims to use the transferable semantics learned from the source domain to predict the sentiment of reviews in the unlabeled target domain.
We present Graph Adaptive Semantic Transfer (GAST) model, an adaptive syntactic graph embedding method that is able to learn domain-invariant semantics from both word sequences and syntactic graphs.
arXiv Detail & Related papers (2022-05-18T07:47:01Z) - Image Synthesis via Semantic Composition [74.68191130898805]
We present a novel approach to synthesize realistic images based on their semantic layouts.
It hypothesizes that for objects with similar appearance, they share similar representation.
Our method establishes dependencies between regions according to their appearance correlation, yielding both spatially variant and associated representations.
arXiv Detail & Related papers (2021-09-15T02:26:07Z) - Structured Multi-modal Feature Embedding and Alignment for
Image-Sentence Retrieval [12.050958976545914]
The current state-of-the-art image-sentence retrieval methods implicitly align the visual-textual fragments.
We propose a novel Structured Multi-modal Feature Embedding and Alignment model for image-sentence retrieval.
In particular, the relations of the visual and textual fragments are modeled by constructing Visual Context-aware Structured Tree encoder (VCS-Tree) and Textual Context-aware Structured Tree encoder (TCS-Tree) with shared labels.
arXiv Detail & Related papers (2021-08-05T07:24:54Z) - Referring Image Segmentation via Cross-Modal Progressive Comprehension [94.70482302324704]
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression.
Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities.
We propose a Cross-Modal Progressive (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task.
arXiv Detail & Related papers (2020-10-01T16:02:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.