ShoeModel: Learning to Wear on the User-specified Shoes via Diffusion Model
- URL: http://arxiv.org/abs/2404.04833v2
- Date: Fri, 19 Jul 2024 07:39:37 GMT
- Title: ShoeModel: Learning to Wear on the User-specified Shoes via Diffusion Model
- Authors: Binghui Chen, Wenyu Li, Yifeng Geng, Xuansong Xie, Wangmeng Zuo,
- Abstract summary: We propose a shoe-wearing system, called Shoe-Model, to generate plausible images of human legs interacting with the given shoes.
Compared to baselines, our ShoeModel is shown to generalize better to different type of shoes and has ability of keeping the ID-consistency of the given shoes.
- Score: 60.60623356092564
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the development of the large-scale diffusion model, Artificial Intelligence Generated Content (AIGC) techniques are popular recently. However, how to truly make it serve our daily lives remains an open question. To this end, in this paper, we focus on employing AIGC techniques in one filed of E-commerce marketing, i.e., generating hyper-realistic advertising images for displaying user-specified shoes by human. Specifically, we propose a shoe-wearing system, called Shoe-Model, to generate plausible images of human legs interacting with the given shoes. It consists of three modules: (1) shoe wearable-area detection module (WD), (2) leg-pose synthesis module (LpS) and the final (3) shoe-wearing image generation module (SW). Them three are performed in ordered stages. Compared to baselines, our ShoeModel is shown to generalize better to different type of shoes and has ability of keeping the ID-consistency of the given shoes, as well as automatically producing reasonable interactions with human. Extensive experiments show the effectiveness of our proposed shoe-wearing system. Figure 1 shows the input and output examples of our ShoeModel.
Related papers
- Instruct Me More! Random Prompting for Visual In-Context Learning [30.31759752239964]
Instruct Me More (InMeMo) is a method that augments in-context pairs with a learnable perturbation (prompt) to explore its potential.
Our experiments on mainstream tasks reveal that InMeMo surpasses the current state-of-the-art performance.
Our findings suggest that InMeMo offers a versatile and efficient way to enhance the performance of visual ICL with lightweight training.
arXiv Detail & Related papers (2023-11-07T01:39:00Z) - Multimodal Detection of Bots on X (Twitter) using Transformers [6.390468088226495]
We propose a novel method for detecting bots in social media.
We use only the user description field and images of three channels.
Experiments conducted on the Cresci'17 and TwiBot-20 datasets demonstrate valuable advantages of our introduced approaches.
arXiv Detail & Related papers (2023-08-28T10:51:11Z) - SUPR: A Sparse Unified Part-Based Human Representation [61.693373050670644]
We show that existing models of the head and hands fail to capture the full range of motion for these parts.
Previous body part models are trained using 3D scans that are isolated to the individual parts.
We propose a new learning scheme that jointly trains a full-body model and specific part models.
arXiv Detail & Related papers (2022-10-25T09:32:34Z) - UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes [91.24112204588353]
We introduce UViM, a unified approach capable of modeling a wide range of computer vision tasks.
In contrast to previous models, UViM has the same functional form for all tasks.
We demonstrate the effectiveness of UViM on three diverse and challenging vision tasks.
arXiv Detail & Related papers (2022-05-20T17:47:59Z) - ShoeRinsics: Shoeprint Prediction for Forensics with Intrinsic
Decomposition [29.408442567550004]
We propose to leverage shoe tread photographs collected by online retailers.
We develop a model that performs intrinsic image decomposition from a single tread photo.
Our approach, which we term ShoeRinsics, combines domain adaptation and re-rendering losses in order to leverage a mix of fully supervised synthetic data and unsupervised retail image data.
arXiv Detail & Related papers (2022-05-04T23:42:55Z) - Fashionformer: A simple, Effective and Unified Baseline for Human
Fashion Segmentation and Recognition [80.74495836502919]
In this work, we focus on joint human fashion segmentation and attribute recognition.
We introduce the object query for segmentation and the attribute query for attribute prediction.
For attribute stream, we design a novel Multi-Layer Rendering module to explore more fine-grained features.
arXiv Detail & Related papers (2022-04-10T11:11:10Z) - ARShoe: Real-Time Augmented Reality Shoe Try-on System on Smartphones [14.494454213703111]
This work proposes a real-time augmented reality virtual shoe try-on system for smartphones, namely ARShoe.
ARShoe adopts a novel multi-branch network to realize pose estimation and segmentation simultaneously.
For training and evaluation, we construct the very first large-scale foot benchmark with multiple virtual shoe try-on task-related labels.
arXiv Detail & Related papers (2021-08-24T03:54:45Z) - AGKD-BML: Defense Against Adversarial Attack by Attention Guided
Knowledge Distillation and Bi-directional Metric Learning [61.8003954296545]
We propose a novel adversarial training-based model by Attention Guided Knowledge Distillation and Bi-directional Metric Learning (AGKD-BML)
Our proposed AGKD-BML model consistently outperforms the state-of-the-art approaches.
arXiv Detail & Related papers (2021-08-13T01:25:04Z) - MERLOT: Multimodal Neural Script Knowledge Models [74.05631672657452]
We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech.
MERLOT exhibits strong out-of-the-box representations of temporal commonsense, and achieves state-of-the-art performance on 12 different video QA datasets.
On Visual Commonsense Reasoning, MERLOT answers questions correctly with 80.6% accuracy, outperforming state-of-the-art models of similar size by over 3%.
arXiv Detail & Related papers (2021-06-04T17:57:39Z) - LGVTON: A Landmark Guided Approach to Virtual Try-On [4.617329011921226]
Given the images of two people: a person and a model, it generates a rendition of the person wearing the clothes of the model.
This is useful considering the fact that on most e-commerce websites images of only clothes are not usually available.
arXiv Detail & Related papers (2020-04-01T16:49:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.