Related papers: A Multimodal Dataset for Enhancing Industrial Task Monitoring and Engagement Prediction

A Multimodal Dataset for Enhancing Industrial Task Monitoring and Engagement Prediction

URL: http://arxiv.org/abs/2501.05936v1
Date: Fri, 10 Jan 2025 12:57:33 GMT
Title: A Multimodal Dataset for Enhancing Industrial Task Monitoring and Engagement Prediction
Authors: Naval Kishore Mehta, Arvind, Himanshu Kumar, Abeer Banerjee, Sumeet Saurav, Sanjay Singh,
Abstract summary: We present a novel dataset that captures realistic assembly and disassembly tasks.<n>The dataset comprises multi-view RGB, depth, and Inertial Measurement Unit (IMU) data collected from 22 sessions, amounting to 290 minutes of untrimmed video.<n>Our approach improves the accuracy of recognizing engagement states, providing a robust solution for monitoring operator performance in dynamic industrial environments.
Score: 5.73110247142357
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Detecting and interpreting operator actions, engagement, and object interactions in dynamic industrial workflows remains a significant challenge in human-robot collaboration research, especially within complex, real-world environments. Traditional unimodal methods often fall short of capturing the intricacies of these unstructured industrial settings. To address this gap, we present a novel Multimodal Industrial Activity Monitoring (MIAM) dataset that captures realistic assembly and disassembly tasks, facilitating the evaluation of key meta-tasks such as action localization, object interaction, and engagement prediction. The dataset comprises multi-view RGB, depth, and Inertial Measurement Unit (IMU) data collected from 22 sessions, amounting to 290 minutes of untrimmed video, annotated in detail for task performance and operator behavior. Its distinctiveness lies in the integration of multiple data modalities and its emphasis on real-world, untrimmed industrial workflows-key for advancing research in human-robot collaboration and operator monitoring. Additionally, we propose a multimodal network that fuses RGB frames, IMU data, and skeleton sequences to predict engagement levels during industrial tasks. Our approach improves the accuracy of recognizing engagement states, providing a robust solution for monitoring operator performance in dynamic industrial environments. The dataset and code can be accessed from https://github.com/navalkishoremehta95/MIAM/.

Related papers

MM-OR: A Large Multimodal Operating Room Dataset for Semantic Understanding of High-Intensity Surgical Environments [49.45034796115852]
Operating rooms (ORs) are complex, high-stakes environments requiring precise understanding of interactions among medical staff, tools, and equipment. Current datasets fall short in scale, realism and do not capture the nature of OR scenes, limiting multimodal in OR modeling. We introduce MM-OR, a realistic and large-scale multimodal OR dataset, and first dataset to enable multimodal scene graph generation.
arXiv Detail & Related papers (2025-03-04T13:00:52Z)
MVIP -- A Dataset and Methods for Application Oriented Multi-View and Multi-Modal Industrial Part Recognition [0.27309692684728604]
MVIP is a novel dataset for multi-modal and multi-view application-oriented industrial part recognition. Our main goal with MVIP is to study and push transferability of various state-of-the-art methods within related downstream tasks.
arXiv Detail & Related papers (2025-02-21T13:22:29Z)
TimberVision: A Multi-Task Dataset and Framework for Log-Component Segmentation and Tracking in Autonomous Forestry Operations [2.0499240875881997]
We introduce the TimberVision dataset, consisting of more than 2k annotated RGB images containing a total of 51k trunk components. We introduce a generic framework to fuse the components detected by our models for both tasks into unified trunk representations. Our solution is suitable for a wide range of application scenarios and can be readily combined with other sensor modalities.
arXiv Detail & Related papers (2025-01-13T14:30:01Z)
JEMA: A Joint Embedding Framework for Scalable Co-Learning with Multimodal Alignment [0.0]
JEMA (Joint Embedding with Multimodal Alignment) is a novel co-learning framework tailored for laser metal deposition (LMD) We report an 8% increase in performance in multimodal settings and a 1% improvement in unimodal settings compared to supervised contrastive learning. Our framework lays the foundation for integrating multisensor data with metadata, enabling diverse downstream tasks within the LMD domain and beyond.
arXiv Detail & Related papers (2024-10-31T14:42:26Z)
Unsupervised Multimodal Fusion of In-process Sensor Data for Advanced Manufacturing Process Monitoring [0.0]
This paper presents a novel approach to multimodal sensor data fusion in manufacturing processes. We leverage contrastive learning techniques to correlate different data modalities without the need for labeled data. Our approach facilitates downstream tasks such as process control, anomaly detection, and quality assurance.
arXiv Detail & Related papers (2024-10-29T21:52:04Z)
IPAD: Industrial Process Anomaly Detection Dataset [71.39058003212614]
Video anomaly detection (VAD) is a challenging task aiming to recognize anomalies in video frames. We propose a new dataset, IPAD, specifically designed for VAD in industrial scenarios. This dataset covers 16 different industrial devices and contains over 6 hours of both synthetic and real-world video footage.
arXiv Detail & Related papers (2024-04-23T13:38:01Z)
Egocentric RGB+Depth Action Recognition in Industry-Like Settings [50.38638300332429]
Our work focuses on recognizing actions from egocentric RGB and Depth modalities in an industry-like environment. Our framework is based on the 3D Video SWIN Transformer to encode both RGB and Depth modalities effectively. Our method also secured first place at the multimodal action recognition challenge at ICIAP 2023.
arXiv Detail & Related papers (2023-09-25T08:56:22Z)
Weakly Supervised Multi-Task Representation Learning for Human Activity Analysis Using Wearables [2.398608007786179]
We propose a weakly supervised multi-output siamese network that learns to map the data into multiple representation spaces. The representation of the data samples are positioned in the space such that the data with the same semantic meaning in that aspect are closely located to each other.
arXiv Detail & Related papers (2023-08-06T08:20:07Z)
MMRNet: Improving Reliability for Multimodal Object Detection and Segmentation for Bin Picking via Multimodal Redundancy [68.7563053122698]
We propose a reliable object detection and segmentation system with MultiModal Redundancy (MMRNet) This is the first system that introduces the concept of multimodal redundancy to address sensor failure issues during deployment. We present a new label-free multi-modal consistency (MC) score that utilizes the output from all modalities to measure the overall system output reliability and uncertainty.
arXiv Detail & Related papers (2022-10-19T19:15:07Z)
MetaGraspNet: A Large-Scale Benchmark Dataset for Vision-driven Robotic Grasping via Physics-based Metaverse Synthesis [78.26022688167133]
We present a large-scale benchmark dataset for vision-driven robotic grasping via physics-based metaverse synthesis. The proposed dataset contains 100,000 images and 25 different object types. We also propose a new layout-weighted performance metric alongside the dataset for evaluating object detection and segmentation performance.
arXiv Detail & Related papers (2021-12-29T17:23:24Z)
Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks [73.63892022944198]
We present a generic perception architecture named Uni-Perceiver. It processes a variety of modalities and tasks with unified modeling and shared parameters. Results show that our pre-trained model without any tuning can achieve reasonable performance even on novel tasks.
arXiv Detail & Related papers (2021-12-02T18:59:50Z)
Taskology: Utilizing Task Relations at Scale [28.09712466727001]
We show that we can leverage the inherent relationships among collections of tasks, as they are trained jointly. explicitly utilizing the relationships between tasks allows improving their performance while dramatically reducing the need for labeled data. We demonstrate our framework on subsets of the following collection of tasks: depth and normal prediction, semantic segmentation, 3D motion and ego-motion estimation, and object tracking and 3D detection in point clouds.
arXiv Detail & Related papers (2020-05-14T22:53:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.