A vision-based framework for human behavior understanding in industrial assembly lines
- URL: http://arxiv.org/abs/2409.17356v1
- Date: Wed, 25 Sep 2024 21:03:13 GMT
- Title: A vision-based framework for human behavior understanding in industrial assembly lines
- Authors: Konstantinos Papoutsakis, Nikolaos Bakalos, Konstantinos Fragkoulis, Athena Zacharia, Georgia Kapetadimitri, Maria Pateraki,
- Abstract summary: This paper introduces a vision-based framework for capturing and understanding human behavior in industrial assembly lines.
The framework leverages advanced computer vision techniques to estimate workers' locations and 3D poses and analyze work postures, actions, and task progress.
A key contribution is the introduction of the CarDA dataset, which contains domain-relevant assembly actions captured in a realistic setting.
- Score: 0.7037008937757392
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces a vision-based framework for capturing and understanding human behavior in industrial assembly lines, focusing on car door manufacturing. The framework leverages advanced computer vision techniques to estimate workers' locations and 3D poses and analyze work postures, actions, and task progress. A key contribution is the introduction of the CarDA dataset, which contains domain-relevant assembly actions captured in a realistic setting to support the analysis of the framework for human pose and action analysis. The dataset comprises time-synchronized multi-camera RGB-D videos, motion capture data recorded in a real car manufacturing environment, and annotations for EAWS-based ergonomic risk scores and assembly activities. Experimental results demonstrate the effectiveness of the proposed approach in classifying worker postures and robust performance in monitoring assembly task progress.
Related papers
- 3D Skeleton-Based Action Recognition: A Review [60.0580120274659]
3D skeleton-based action recognition has become a prominent topic in the field of computer vision.<n>Previous reviews have predominantly adopted a model-oriented perspective, often neglecting the fundamental steps involved in skeleton-based action recognition.<n>This review aims to address these limitations by presenting a comprehensive, task-oriented framework for understanding skeleton-based action recognition.
arXiv Detail & Related papers (2025-06-01T09:04:12Z) - Object-Centric Action-Enhanced Representations for Robot Visuo-Motor Policy Learning [21.142247150423863]
We propose an object-centric encoder that performs semantic segmentation and visual representation generation in a coupled manner.<n>To achieve this, we leverage the Slot Attention mechanism and use the SOLV model, pretrained in large out-of-domain datasets.<n>We show that exploiting models pretrained on out-of-domain datasets can benefit this process, and that fine-tuning on datasets depicting human actions can significantly improve performance.
arXiv Detail & Related papers (2025-05-27T09:56:52Z) - A Multimodal Dataset for Enhancing Industrial Task Monitoring and Engagement Prediction [5.73110247142357]
We present a novel dataset that captures realistic assembly and disassembly tasks.
The dataset comprises multi-view RGB, depth, and Inertial Measurement Unit (IMU) data collected from 22 sessions, amounting to 290 minutes of untrimmed video.
Our approach improves the accuracy of recognizing engagement states, providing a robust solution for monitoring operator performance in dynamic industrial environments.
arXiv Detail & Related papers (2025-01-10T12:57:33Z) - Flex: End-to-End Text-Instructed Visual Navigation from Foundation Model Features [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.<n>Our findings are synthesized in Flex (Fly lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.<n>We demonstrate the effectiveness of this approach on a quadrotor fly-to-target task, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z) - Observe Then Act: Asynchronous Active Vision-Action Model for Robotic Manipulation [13.736566979493613]
Our model serially connects a camera Next-Best-View (NBV) policy with a gripper Next-Best Pose (NBP) policy, and trains them in a sensor-motor coordination framework using few-shot reinforcement learning.
This approach allows the agent to adjust a third-person camera to actively observe the environment based on the task goal, and subsequently infer the appropriate manipulation actions.
The results demonstrate that our model consistently outperforms baseline algorithms, showcasing its effectiveness in handling visual constraints in manipulation tasks.
arXiv Detail & Related papers (2024-09-23T10:38:20Z) - Zero-Shot Object-Centric Representation Learning [72.43369950684057]
We study current object-centric methods through the lens of zero-shot generalization.
We introduce a benchmark comprising eight different synthetic and real-world datasets.
We find that training on diverse real-world images improves transferability to unseen scenarios.
arXiv Detail & Related papers (2024-08-17T10:37:07Z) - The Collection of a Human Robot Collaboration Dataset for Cooperative Assembly in Glovebox Environments [2.30069810310356]
Industry 4.0 introduced AI as a transformative solution for modernizing manufacturing processes. Its successor, Industry 5.0, envisions humans as collaborators and experts guiding these AI-driven manufacturing solutions.
New techniques require algorithms capable of safe, real-time identification of human positions in a scene, particularly their hands, during collaborative assembly.
This dataset provides challenging examples to build applications toward hand and glove segmentation in industrial human-robot collaboration scenarios.
arXiv Detail & Related papers (2024-07-19T19:56:53Z) - Evaluating Robustness of Visual Representations for Object Assembly Task
Requiring Spatio-Geometrical Reasoning [8.626019848533707]
This paper focuses on evaluating and benchmarking the robustness of visual representations in the context of object assembly tasks.
We employ a general framework in visuomotor policy learning that utilizes visual pretraining models as vision encoders.
Our study investigates the robustness of this framework when applied to a dual-arm manipulation setup, specifically to the grasp variations.
arXiv Detail & Related papers (2023-10-15T20:41:07Z) - Robotic Handling of Compliant Food Objects by Robust Learning from
Demonstration [79.76009817889397]
We propose a robust learning policy based on Learning from Demonstration (LfD) for robotic grasping of food compliant objects.
We present an LfD learning policy that automatically removes inconsistent demonstrations, and estimates the teacher's intended policy.
The proposed approach has a vast range of potential applications in the aforementioned industry sectors.
arXiv Detail & Related papers (2023-09-22T13:30:26Z) - ALP: Action-Aware Embodied Learning for Perception [60.64801970249279]
We introduce Action-Aware Embodied Learning for Perception (ALP)
ALP incorporates action information into representation learning through a combination of optimizing a reinforcement learning policy and an inverse dynamics prediction objective.
We show that ALP outperforms existing baselines in several downstream perception tasks.
arXiv Detail & Related papers (2023-06-16T21:51:04Z) - ATTACH Dataset: Annotated Two-Handed Assembly Actions for Human Action
Understanding [8.923830513183882]
We present the ATTACH dataset, which contains 51.6 hours of assembly with 95.2k annotated fine-grained actions monitored by three cameras.
In the ATTACH dataset, more than 68% of annotations overlap with other annotations, which is many times more than in related datasets.
We report the performance of state-of-the-art methods for action recognition as well as action detection on video and skeleton-sequence inputs.
arXiv Detail & Related papers (2023-04-17T12:31:24Z) - Object-Centric Scene Representations using Active Inference [4.298360054690217]
Representing a scene and its constituent objects from raw sensory data is a core ability for enabling robots to interact with their environment.
We propose a novel approach for scene understanding, leveraging a hierarchical object-centric generative model that enables an agent to infer object category.
For evaluating the behavior of an active vision agent, we also propose a new benchmark where, given a target viewpoint of a particular object, the agent needs to find the best matching viewpoint.
arXiv Detail & Related papers (2023-02-07T06:45:19Z) - Self-supervised Video Object Segmentation by Motion Grouping [79.13206959575228]
We develop a computer vision system able to segment objects by exploiting motion cues.
We introduce a simple variant of the Transformer to segment optical flow frames into primary objects and the background.
We evaluate the proposed architecture on public benchmarks (DAVIS2016, SegTrackv2, and FBMS59)
arXiv Detail & Related papers (2021-04-15T17:59:32Z) - TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild [77.59069361196404]
TRiPOD is a novel method for predicting body dynamics based on graph attentional networks.
To incorporate a real-world challenge, we learn an indicator representing whether an estimated body joint is visible/invisible at each frame.
Our evaluation shows that TRiPOD outperforms all prior work and state-of-the-art specifically designed for each of the trajectory and pose forecasting tasks.
arXiv Detail & Related papers (2021-04-08T20:01:00Z) - The IKEA ASM Dataset: Understanding People Assembling Furniture through
Actions, Objects and Pose [108.21037046507483]
IKEA ASM is a three million frame, multi-view, furniture assembly video dataset that includes depth, atomic actions, object segmentation, and human pose.
We benchmark prominent methods for video action recognition, object segmentation and human pose estimation tasks on this challenging dataset.
The dataset enables the development of holistic methods, which integrate multi-modal and multi-view data to better perform on these tasks.
arXiv Detail & Related papers (2020-07-01T11:34:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.