Unitail: Detecting, Reading, and Matching in Retail Scene
- URL: http://arxiv.org/abs/2204.00298v1
- Date: Fri, 1 Apr 2022 09:06:48 GMT
- Title: Unitail: Detecting, Reading, and Matching in Retail Scene
- Authors: Fangyi Chen, Han Zhang, Zaiwang Li, Jiachen Dou, Shentong Mo, Hao
Chen, Yongxin Zhang, Uzair Ahmed, Chenchen Zhu, Marios Savvides
- Abstract summary: We introduce the United Retail dataset, a benchmark of basic visual tasks on products.
With 1.8M quadrilateral-shaped instances, the Unitail offers a detection dataset to align product appearance better.
It also provides a gallery-style OCR dataset containing 1454 product categories, 30k text regions, and 21k transcriptions.
- Score: 37.1516435926562
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To make full use of computer vision technology in stores, it is required to
consider the actual needs that fit the characteristics of the retail scene.
Pursuing this goal, we introduce the United Retail Datasets (Unitail), a
large-scale benchmark of basic visual tasks on products that challenges
algorithms for detecting, reading, and matching. With 1.8M quadrilateral-shaped
instances annotated, the Unitail offers a detection dataset to align product
appearance better. Furthermore, it provides a gallery-style OCR dataset
containing 1454 product categories, 30k text regions, and 21k transcriptions to
enable robust reading on products and motivate enhanced product matching.
Besides benchmarking the datasets using various state-of-the-arts, we customize
a new detector for product detection and provide a simple OCR-based matching
solution that verifies its effectiveness.
Related papers
- Exploring Fine-grained Retail Product Discrimination with Zero-shot Object Classification Using Vision-Language Models [50.370043676415875]
In smart retail applications, the large number of products and their frequent turnover necessitate reliable zero-shot object classification methods.
We introduce the MIMEX dataset, comprising 28 distinct product categories.
We benchmark the zero-shot object classification performance of state-of-the-art vision-language models (VLMs) on the proposed MIMEX dataset.
arXiv Detail & Related papers (2024-09-23T12:28:40Z) - Text-Based Product Matching -- Semi-Supervised Clustering Approach [9.748519919202986]
This paper aims to present a new philosophy to product matching utilizing a semi-supervised clustering approach.
We study the properties of this method by experimenting with the IDEC algorithm on the real-world dataset.
arXiv Detail & Related papers (2024-02-01T18:52:26Z) - Overview of the TREC 2023 Product Product Search Track [70.56592126043546]
This is the first year of the TREC Product search track.
The focus was the creation of a reusable collection.
We leverage the new product search corpus, which includes contextual metadata.
arXiv Detail & Related papers (2023-11-14T02:25:18Z) - Retail-786k: a Large-Scale Dataset for Visual Entity Matching [0.0]
This paper introduces the first publicly available large-scale dataset for "visual entity matching"
We provide a total of 786k manually annotated, high resolution product images containing 18k different individual retail products which are grouped into 3k entities.
The proposed "visual entity matching" constitutes a novel learning problem which can not sufficiently be solved using standard image based classification and retrieval algorithms.
arXiv Detail & Related papers (2023-09-29T11:58:26Z) - Turning a CLIP Model into a Scene Text Spotter [73.63953542526917]
We exploit the potential of the large-scale Contrastive Language-Image Pretraining (CLIP) model to enhance scene text detection and spotting tasks.
This backbone utilizes visual prompt learning and cross-attention in CLIP to extract image and text-based prior knowledge.
FastTCM-CR50 introduces an instance-language matching process to enhance the synergy between image and text embeddings.
arXiv Detail & Related papers (2023-08-21T01:25:48Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Visual Information Extraction in the Wild: Practical Dataset and
End-to-end Solution [48.693941280097974]
We propose a large-scale dataset consisting of camera images for visual information extraction (VIE)
We propose a novel framework for end-to-end VIE that combines the stages of OCR and information extraction in an end-to-end learning fashion.
We evaluate the existing end-to-end methods for VIE on the proposed dataset and observe that the performance of these methods has a distinguishable drop from SROIE to our proposed dataset due to the larger variance of layout and entities.
arXiv Detail & Related papers (2023-05-12T14:11:47Z) - An Improved Deep Learning Approach For Product Recognition on Racks in
Retail Stores [2.470815298095903]
Automated product recognition in retail stores is an important real-world application in the domain of Computer Vision and Pattern Recognition.
We develop a two-stage object detection and recognition pipeline comprising of a Faster-RCNN-based object localizer and a ResNet-18-based image encoder.
Each of the models is fine-tuned using appropriate data sets for better prediction and data augmentation is performed on each query image to prepare an extensive gallery set for fine-tuning the ResNet-18-based product recognition model.
arXiv Detail & Related papers (2022-02-26T06:51:36Z) - Tiny Object Tracking: A Large-scale Dataset and A Baseline [40.93697515531104]
We create a large-scale video dataset, which contains 434 sequences with a total of more than 217K frames.
In data creation, we take 12 challenge attributes into account to cover a broad range of viewpoints and scene complexities.
We propose a novel Multilevel Knowledge Distillation Network (MKDNet), which pursues three-level knowledge distillations in a unified framework.
arXiv Detail & Related papers (2022-02-11T15:00:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.