VISTA: Vision Transformer enhanced by U-Net and Image Colorfulness Frame
Filtration for Automatic Retail Checkout
- URL: http://arxiv.org/abs/2204.11024v1
- Date: Sat, 23 Apr 2022 08:54:28 GMT
- Title: VISTA: Vision Transformer enhanced by U-Net and Image Colorfulness Frame
Filtration for Automatic Retail Checkout
- Authors: Md. Istiak Hossain Shihab, Nazia Tasnim, Hasib Zunair, Labiba Kanij
Rupty and Nabeel Mohammed
- Abstract summary: We propose to segment and classify individual frames from a video sequence.
The segmentation method consists of a unified single product item- and hand-segmentation followed by entropy masking.
Our best system achieves 3rd place in the AI City Challenge 2022 Track 4 with an F1 score of 0.4545.
- Score: 0.7250756081498245
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-class product counting and recognition identifies product items from
images or videos for automated retail checkout. The task is challenging due to
the real-world scenario of occlusions where product items overlap, fast
movement in the conveyor belt, large similarity in overall appearance of the
items being scanned, novel products, and the negative impact of misidentifying
items. Further, there is a domain bias between training and test sets,
specifically, the provided training dataset consists of synthetic images and
the test set videos consist of foreign objects such as hands and tray. To
address these aforementioned issues, we propose to segment and classify
individual frames from a video sequence. The segmentation method consists of a
unified single product item- and hand-segmentation followed by entropy masking
to address the domain bias problem. The multi-class classification method is
based on Vision Transformers (ViT). To identify the frames with target objects,
we utilize several image processing methods and propose a custom metric to
discard frames not having any product items. Combining all these mechanisms,
our best system achieves 3rd place in the AI City Challenge 2022 Track 4 with
an F1 score of 0.4545. Code will be available at
Related papers
- Exploring Fine-grained Retail Product Discrimination with Zero-shot Object Classification Using Vision-Language Models [50.370043676415875]
In smart retail applications, the large number of products and their frequent turnover necessitate reliable zero-shot object classification methods.
We introduce the MIMEX dataset, comprising 28 distinct product categories.
We benchmark the zero-shot object classification performance of state-of-the-art vision-language models (VLMs) on the proposed MIMEX dataset.
arXiv Detail & Related papers (2024-09-23T12:28:40Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - STOW: Discrete-Frame Segmentation and Tracking of Unseen Objects for
Warehouse Picking Robots [41.017649190833076]
We propose a novel paradigm for joint segmentation and tracking in discrete frames along with a transformer module.
The experiments we conduct show that our approach significantly outperforms recent methods.
arXiv Detail & Related papers (2023-11-04T06:52:38Z) - ZJU ReLER Submission for EPIC-KITCHEN Challenge 2023: TREK-150 Single
Object Tracking [62.98078087018469]
We introduce MSDeAOT, a variant of the AOT framework that incorporates transformers at multiple feature scales.
MSDeAOT efficiently propagates object masks from previous frames to the current frame using two feature scales of 16 and 8.
As a testament to the effectiveness of our design, we achieved the 1st place in the EPIC-KITCHENS TREK-150 Object Tracking Challenge.
arXiv Detail & Related papers (2023-07-05T03:50:58Z) - Automatic Generation of Product-Image Sequence in E-commerce [46.06263129000091]
Multi-modality Unified Imagesequence (MUIsC) is able to simultaneously detect all categories through learning rule violations.
By Dec 2021, our AGPIS framework has generated high-standard images for about 1.5 million products and achieves 13.6% in reject rate.
arXiv Detail & Related papers (2022-06-26T23:38:42Z) - Tag-Based Attention Guided Bottom-Up Approach for Video Instance
Segmentation [83.13610762450703]
Video instance is a fundamental computer vision task that deals with segmenting and tracking object instances across a video sequence.
We introduce a simple end-to-end train bottomable-up approach to achieve instance mask predictions at the pixel-level granularity, instead of the typical region-proposals-based approach.
Our method provides competitive results on YouTube-VIS and DAVIS-19 datasets, and has minimum run-time compared to other contemporary state-of-the-art performance methods.
arXiv Detail & Related papers (2022-04-22T15:32:46Z) - Semi-supervised and Deep learning Frameworks for Video Classification
and Key-frame Identification [1.2335698325757494]
We present two semi-supervised approaches that automatically classify scenes for content and filter frames for scene understanding tasks.
The proposed framework can be scaled to additional video data streams for automated training of perception-driven systems.
arXiv Detail & Related papers (2022-03-25T05:45:18Z) - A Unified Transformer Framework for Group-based Segmentation:
Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection [59.21990697929617]
Humans tend to mine objects by learning from a group of images or several frames of video since we live in a dynamic world.
Previous approaches design different networks on similar tasks separately, and they are difficult to apply to each other.
We introduce a unified framework to tackle these issues, term as UFO (UnifiedObject Framework for Co-Object Framework)
arXiv Detail & Related papers (2022-03-09T13:35:19Z) - Contrastive Transformation for Self-supervised Correspondence Learning [120.62547360463923]
We study the self-supervised learning of visual correspondence using unlabeled videos in the wild.
Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation.
Our framework outperforms the recent self-supervised correspondence methods on a range of visual tasks.
arXiv Detail & Related papers (2020-12-09T14:05:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.