Fashionformer: A simple, Effective and Unified Baseline for Human
Fashion Segmentation and Recognition
- URL: http://arxiv.org/abs/2204.04654v1
- Date: Sun, 10 Apr 2022 11:11:10 GMT
- Title: Fashionformer: A simple, Effective and Unified Baseline for Human
Fashion Segmentation and Recognition
- Authors: Shilin Xu, Xiangtai Li, Jingbo Wang, Guangliang Cheng, Yunhai Tong,
Dacheng Tao
- Abstract summary: In this work, we focus on joint human fashion segmentation and attribute recognition.
We introduce the object query for segmentation and the attribute query for attribute prediction.
For attribute stream, we design a novel Multi-Layer Rendering module to explore more fine-grained features.
- Score: 80.74495836502919
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human fashion understanding is one important computer vision task since it
has the comprehensive information that can be used for real-world applications.
In this work, we focus on joint human fashion segmentation and attribute
recognition. Contrary to the previous works that separately model each task as
a multi-head prediction problem, our insight is to bridge these two tasks with
one unified model via vision transformer modeling to benefit each task. In
particular, we introduce the object query for segmentation and the attribute
query for attribute prediction. Both queries and their corresponding features
can be linked via mask prediction. Then we adopt a two-stream query learning
framework to learn the decoupled query representations. For attribute stream,
we design a novel Multi-Layer Rendering module to explore more fine-grained
features. The decoder design shares the same spirits with DETR, thus we name
the proposed method Fahsionformer. Extensive experiments on three human fashion
datasets including Fashionpedia, ModaNet and Deepfashion illustrate the
effectiveness of our approach. In particular, our method with the same backbone
achieve relative 10% improvements than previous works in case of \textit{a
joint metric ( AP$^{\text{mask}}_{\text{IoU+F}_1}$) for both segmentation and
attribute recognition}. To the best of our knowledge, we are the first unified
end-to-end vision transformer framework for human fashion analysis. We hope
this simple yet effective method can serve as a new flexible baseline for
fashion analysis. Code will be available at
https://github.com/xushilin1/FashionFormer.
Related papers
- Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition [58.79807861739438]
Existing pedestrian recognition (PAR) algorithms are mainly developed based on a static image.
We propose to understand human attributes using video frames that can fully use temporal information.
arXiv Detail & Related papers (2024-04-27T14:43:32Z) - ONE-PEACE: Exploring One General Representation Model Toward Unlimited
Modalities [71.15303690248021]
We release ONE-PEACE, a highly model with 4B parameters that can seamlessly align and integrate representations across vision, audio, and language modalities.
The architecture of ONE-PEACE comprises modality adapters, shared self-attention layers, and modality FFNs.
With the scaling-friendly architecture and pretraining tasks, ONE-PEACE has the potential to expand to unlimited modalities.
arXiv Detail & Related papers (2023-05-18T17:59:06Z) - Chain-of-Skills: A Configurable Model for Open-domain Question Answering [79.8644260578301]
The retrieval model is an indispensable component for real-world knowledge-intensive tasks.
Recent work focuses on customized methods, limiting the model transferability and scalability.
We propose a modular retriever where individual modules correspond to key skills that can be reused across datasets.
arXiv Detail & Related papers (2023-05-04T20:19:39Z) - Exploring Effective Factors for Improving Visual In-Context Learning [56.14208975380607]
In-Context Learning (ICL) is to understand a new task via a few demonstrations (aka. prompt) and predict new inputs without tuning the models.
This paper shows that prompt selection and prompt fusion are two major factors that have a direct impact on the inference performance of visual context learning.
We propose a simple framework prompt-SelF for visual in-context learning.
arXiv Detail & Related papers (2023-04-10T17:59:04Z) - Ered: Enhanced Text Representations with Entities and Descriptions [5.977668609935748]
External knowledge,e.g., entities and entity descriptions, can help humans understand texts.
This paper aims to explicitly include both entities and entity descriptions in the fine-tuning stage.
We conducted experiments on four knowledge-oriented tasks and two common tasks, and the results achieved new state-of-the-art on several datasets.
arXiv Detail & Related papers (2022-08-18T16:51:16Z) - Mapping the Internet: Modelling Entity Interactions in Complex
Heterogeneous Networks [0.0]
We propose a versatile, unified framework called HMill' for sample representation, model definition and training.
We show an extension of the universal approximation theorem to the set of all functions realized by models implemented in the framework.
We solve three different problems from the cybersecurity domain using the framework.
arXiv Detail & Related papers (2021-04-19T21:32:44Z) - Reviving Iterative Training with Mask Guidance for Interactive
Segmentation [8.271859911016719]
Recent works on click-based interactive segmentation have demonstrated state-of-the-art results by using various inference-time optimization schemes.
We propose a simple feedforward model for click-based interactive segmentation that employs the segmentation masks from previous steps.
We find that the models trained on a combination of COCO and LVIS with diverse and high-quality annotations show performance superior to all existing models.
arXiv Detail & Related papers (2021-02-12T15:44:31Z) - Fashionpedia: Ontology, Segmentation, and an Attribute Localization
Dataset [62.77342894987297]
We propose a novel Attribute-Mask RCNN model to jointly perform instance segmentation and localized attribute recognition.
We also demonstrate instance segmentation models pre-trained on Fashionpedia achieve better transfer learning performance on other fashion datasets than ImageNet pre-training.
arXiv Detail & Related papers (2020-04-26T02:38:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.