Main Product Detection with Graph Networks for Fashion
- URL: http://arxiv.org/abs/2201.10431v1
- Date: Tue, 25 Jan 2022 16:26:04 GMT
- Title: Main Product Detection with Graph Networks for Fashion
- Authors: Vacit Oguz Yazici, Longlong Yu, Arnau Ramisa, Luis Herranz, Joost van
de Weijer
- Abstract summary: Main product detection is a crucial step of vision-based fashion product feed parsing pipelines.
We propose a model that incorporates Graph Convolutional Networks (GCN) that jointly represent all detected bounding boxes in the gallery as nodes.
- Score: 44.09686303429833
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Computer vision has established a foothold in the online fashion retail
industry. Main product detection is a crucial step of vision-based fashion
product feed parsing pipelines, focused in identifying the bounding boxes that
contain the product being sold in the gallery of images of the product page.
The current state-of-the-art approach does not leverage the relations between
regions in the image, and treats images of the same product independently,
therefore not fully exploiting visual and product contextual information. In
this paper we propose a model that incorporates Graph Convolutional Networks
(GCN) that jointly represent all detected bounding boxes in the gallery as
nodes. We show that the proposed method is better than the state-of-the-art,
especially, when we consider the scenario where title-input is missing at
inference time and for cross-dataset evaluation, our method outperforms
previous approaches by a large margin.
Related papers
- Preserving Product Fidelity in Large Scale Image Recontextualization with Diffusion Models [1.8606057023042066]
We present a framework for high-fidelity product image recontextualization using text-to-image diffusion models and a novel data augmentation pipeline.
Our method improves the quality and diversity of generated images by disentangling product representations and enhancing the model's understanding of product characteristics.
arXiv Detail & Related papers (2025-03-11T01:24:39Z) - Consistent Human Image and Video Generation with Spatially Conditioned Diffusion [82.4097906779699]
Consistent human-centric image and video synthesis aims to generate images with new poses while preserving appearance consistency with a given reference image.
We frame the task as a spatially-conditioned inpainting problem, where the target image is in-painted to maintain appearance consistency with the reference.
This approach enables the reference features to guide the generation of pose-compliant targets within a unified denoising network.
arXiv Detail & Related papers (2024-12-19T05:02:30Z) - See then Tell: Enhancing Key Information Extraction with Vision Grounding [54.061203106565706]
We introduce STNet (See then Tell Net), a novel end-to-end model designed to deliver precise answers with relevant vision grounding.
To enhance the model's seeing capabilities, we collect extensive structured table recognition datasets.
arXiv Detail & Related papers (2024-09-29T06:21:05Z) - Training-Free Style Consistent Image Synthesis with Condition and Mask Guidance in E-Commerce [13.67619785783182]
We introduce the concept of the QKV level, referring to modifications in the attention maps (self-attention and cross-attention) when integrating UNet with image conditions.
We use shared KV to enhance similarity in cross-attention and generate mask guidance from the attention map to cleverly direct the generation of style-consistent images.
arXiv Detail & Related papers (2024-09-07T07:50:13Z) - A Multimodal Approach for Cross-Domain Image Retrieval [5.5547914920738]
Cross-Domain Image Retrieval (CDIR) is a challenging task in computer vision.
This paper introduces a novel unsupervised approach to CDIR that incorporates textual context by leveraging pre-trained vision-language models.
Our method, dubbed as Caption-Matching (CM), uses generated image captions as a domain-agnostic intermediate representation.
arXiv Detail & Related papers (2024-03-22T12:08:16Z) - Explore In-Context Segmentation via Latent Diffusion Models [132.26274147026854]
In-context segmentation aims to segment objects using given reference images.
Most existing approaches adopt metric learning or masked image modeling to build the correlation between visual prompts and input image queries.
This work approaches the problem from a fresh perspective - unlocking the capability of the latent diffusion model for in-context segmentation.
arXiv Detail & Related papers (2024-03-14T17:52:31Z) - Diffuse to Choose: Enriching Image Conditioned Inpainting in Latent
Diffusion Models for Virtual Try-All [4.191273360964305]
"Diffuse to Choose" is a novel diffusion-based inpainting model that efficiently balances fast inference with the retention of high-fidelity details.
We conduct extensive testing on both in-house and publicly available datasets, and show that Diffuse to Choose is superior to existing zero-shot diffusion inpainting methods.
arXiv Detail & Related papers (2024-01-24T20:25:48Z) - Leveraging Open-Vocabulary Diffusion to Camouflaged Instance
Segmentation [59.78520153338878]
Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions.
We propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations.
arXiv Detail & Related papers (2023-12-29T07:59:07Z) - UpFusion: Novel View Diffusion from Unposed Sparse View Observations [66.36092764694502]
UpFusion can perform novel view synthesis and infer 3D representations for an object given a sparse set of reference images.
We show that this mechanism allows generating high-fidelity novel views while improving the synthesis quality given additional (unposed) images.
arXiv Detail & Related papers (2023-12-11T18:59:55Z) - Mutual Query Network for Multi-Modal Product Image Segmentation [13.192334066413837]
We propose a mutual query network to segment products based on both visual and linguistic modalities.
To promote the research in this field, we also construct a Multi-Modal Product dataset (MMPS)
The proposed method significantly outperforms the state-of-the-art methods on MMPS.
arXiv Detail & Related papers (2023-06-26T03:18:38Z) - Towards Unsupervised Sketch-based Image Retrieval [126.77787336692802]
We introduce a novel framework that simultaneously performs unsupervised representation learning and sketch-photo domain alignment.
Our framework achieves excellent performance in the new unsupervised setting, and performs comparably or better than state-of-the-art in the zero-shot setting.
arXiv Detail & Related papers (2021-05-18T02:38:22Z) - Self-supervised Human Detection and Segmentation via Multi-view
Consensus [116.92405645348185]
We propose a multi-camera framework in which geometric constraints are embedded in the form of multi-view consistency during training.
We show that our approach outperforms state-of-the-art self-supervised person detection and segmentation techniques on images that visually depart from those of standard benchmarks.
arXiv Detail & Related papers (2020-12-09T15:47:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.