Related papers: Efficient Retail Video Annotation: A Robust Key Frame Generation Approach for Product and Customer Interaction Analysis

Efficient Retail Video Annotation: A Robust Key Frame Generation Approach for Product and Customer Interaction Analysis

URL: http://arxiv.org/abs/2506.14854v2
Date: Thu, 19 Jun 2025 04:04:23 GMT
Title: Efficient Retail Video Annotation: A Robust Key Frame Generation Approach for Product and Customer Interaction Analysis
Authors: Varun Mannam, Zhenyu Shi,
Abstract summary: We propose a deep learning-based approach that automates key-frame identification in retail videos.<n>Our approach leads to an average of 2 times cost savings in video annotation.<n>The automation of key-frame detection enables substantial time and effort savings in retail video labeling tasks.
Score: 1.0852294343899487
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Accurate video annotation plays a vital role in modern retail applications, including customer behavior analysis, product interaction detection, and in-store activity recognition. However, conventional annotation methods heavily rely on time-consuming manual labeling by human annotators, introducing non-robust frame selection and increasing operational costs. To address these challenges in the retail domain, we propose a deep learning-based approach that automates key-frame identification in retail videos and provides automatic annotations of products and customers. Our method leverages deep neural networks to learn discriminative features by embedding video frames and incorporating object detection-based techniques tailored for retail environments. Experimental results showcase the superiority of our approach over traditional methods, achieving accuracy comparable to human annotator labeling while enhancing the overall efficiency of retail video annotation. Remarkably, our approach leads to an average of 2 times cost savings in video annotation. By allowing human annotators to verify/adjust less than 5% of detected frames in the video dataset, while automating the annotation process for the remaining frames without reducing annotation quality, retailers can significantly reduce operational costs. The automation of key-frame detection enables substantial time and effort savings in retail video labeling tasks, proving highly valuable for diverse retail applications such as shopper journey analysis, product interaction detection, and in-store security monitoring.

Related papers

SAM2Auto: Auto Annotation Using FLASH [13.638155035372835]
Vision-Language Models (VLMs) lag behind Large Language Models due to the scarcity of annotated datasets.<n>We introduce SAM2Auto, the first fully automated annotation pipeline for video datasets requiring no human intervention or dataset-specific training.<n>Our system employs statistical approaches to minimize detection errors while ensuring consistent object tracking throughout entire video sequences.
arXiv Detail & Related papers (2025-06-09T15:15:15Z)
Understanding Long Videos via LLM-Powered Entity Relation Graphs [51.13422967711056]
GraphVideoAgent is a framework that maps and monitors the evolving relationships between visual entities throughout the video sequence.<n>Our approach demonstrates remarkable effectiveness when tested against industry benchmarks.
arXiv Detail & Related papers (2025-01-27T10:57:24Z)
A Comprehensive Review of Few-shot Action Recognition [64.47305887411275]
Few-shot action recognition aims to address the high cost and impracticality of manually labeling complex and variable video data.<n>It requires accurately classifying human actions in videos using only a few labeled examples per class.<n>Numerous approaches have driven significant advancements in few-shot action recognition.
arXiv Detail & Related papers (2024-07-20T03:53:32Z)
Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals. Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars. Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z)
Retail store customer behavior analysis system: Design and Implementation [2.215731214298625]
We propose a framework that includes three primary parts: mathematical modeling of customer behaviors, behavior analysis using an efficient deep learning based system, and individual and group behavior visualization. Each module and the entire system were validated using data from actual situations in a retail store.
arXiv Detail & Related papers (2023-09-05T06:26:57Z)
A Hybrid Statistical-Machine Learning Approach for Analysing Online Customer Behavior: An Empirical Study [2.126171264016785]
We develop a hybrid interpretable model to analyse 454,897 online customers' behavior for a particular product category at the largest online retailer in China, that is JD. Our results reveal that customers' product choice is insensitive to the promised delivery time, but this factor significantly impacts customers' order quantity. We identify product classes for which certain discounting approaches are more effective and provide recommendations on better use of different discounting tools.
arXiv Detail & Related papers (2022-12-01T19:37:29Z)
Detecting Disengagement in Virtual Learning as an Anomaly [4.706263507340607]
Student engagement is an important factor in meeting the goals of virtual learning programs. In this paper, we formulate detecting disengagement in virtual learning as an anomaly detection problem. We design various autoencoders, including temporal convolutional network autoencoder, long-short-term memory autoencoder.
arXiv Detail & Related papers (2022-11-13T10:29:25Z)
Video Annotation for Visual Tracking via Selection and Refinement [74.08109740917122]
We present a new framework to facilitate bounding box annotations for video sequences. A temporal assessment network is proposed which is able to capture the temporal coherence of target locations. A visual-geometry refinement network is also designed to further enhance the selected tracking results.
arXiv Detail & Related papers (2021-08-09T05:56:47Z)
Joint Inductive and Transductive Learning for Video Object Segmentation [107.32760625159301]
Semi-supervised object segmentation is a task of segmenting the target object in a video sequence given only a mask in the first frame. Most previous best-performing methods adopt matching-based transductive reasoning or online inductive learning. We propose to integrate transductive and inductive learning into a unified framework to exploit complement between them for accurate and robust video object segmentation.
arXiv Detail & Related papers (2021-08-08T16:25:48Z)
MIST: Multiple Instance Self-Training Framework for Video Anomaly Detection [76.80153360498797]
We develop a multiple instance self-training framework (MIST) to efficiently refine task-specific discriminative representations. MIST is composed of 1) a multiple instance pseudo label generator, which adapts a sparse continuous sampling strategy to produce more reliable clip-level pseudo labels, and 2) a self-guided attention boosted feature encoder. Our method performs comparably to or even better than existing supervised and weakly supervised methods, specifically obtaining a frame-level AUC 94.83% on ShanghaiTech.
arXiv Detail & Related papers (2021-04-04T15:47:14Z)
OPAM: Online Purchasing-behavior Analysis using Machine learning [0.8121462458089141]
We present a customer purchasing behavior analysis system using supervised, unsupervised and semi-supervised learning methods. The proposed system analyzes session and user-journey level purchasing behaviors to identify customer categories/clusters.
arXiv Detail & Related papers (2021-02-02T17:29:52Z)
Self-trained Deep Ordinal Regression for End-to-End Video Anomaly Detection [114.9714355807607]
We show that applying self-trained deep ordinal regression to video anomaly detection overcomes two key limitations of existing methods. We devise an end-to-end trainable video anomaly detection approach that enables joint representation learning and anomaly scoring without manually labeled normal/abnormal data.
arXiv Detail & Related papers (2020-03-15T08:44:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.