Related papers: Fine-grained activity recognition for assembly videos

Fine-grained activity recognition for assembly videos

URL: http://arxiv.org/abs/2012.01392v1
Date: Wed, 2 Dec 2020 18:38:17 GMT
Title: Fine-grained activity recognition for assembly videos
Authors: Jonathan D. Jones, Cathryn Cortesa, Amy Shelton, Barbara Landau, Sanjeev Khudanpur, and Gregory D. Hager
Abstract summary: We extend the fine-grained activity recognition setting to address the task of assembly action recognition in its full generality. We develop a general method for recognizing assembly actions from observation sequences, along with observation features that take advantage of a spatial assembly's special structure.
Score: 31.468641678626696
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper we address the task of recognizing assembly actions as a structure (e.g. a piece of furniture or a toy block tower) is built up from a set of primitive objects. Recognizing the full range of assembly actions requires perception at a level of spatial detail that has not been attempted in the action recognition literature to date. We extend the fine-grained activity recognition setting to address the task of assembly action recognition in its full generality by unifying assembly actions and kinematic structures within a single framework. We use this framework to develop a general method for recognizing assembly actions from observation sequences, along with observation features that take advantage of a spatial assembly's special structure. Finally, we evaluate our method empirically on two application-driven data sources: (1) An IKEA furniture-assembly dataset, and (2) A block-building dataset. On the first, our system recognizes assembly actions with an average framewise accuracy of 70% and an average normalized edit distance of 10%. On the second, which requires fine-grained geometric reasoning to distinguish between assemblies, our system attains an average normalized edit distance of 23% -- a relative improvement of 69% over prior work.

Related papers

AI Assisted AR Assembly: Object Recognition and Computer Vision for Augmented Reality Assisted Assembly [40.836596733334254]
We present an AI-assisted Augmented Reality assembly workflow that uses deep learning-based object recognition.<n>For each assembly step, the system displays a bounding box around the corresponding components in the physical space, and where the component should be placed.
arXiv Detail & Related papers (2025-11-07T16:20:53Z)
Towards Open-World Human Action Segmentation Using Graph Convolutional Networks [6.167678490008973]
Most existing learning-based methods excel in closed-world action segmentation.<n>We propose a structured framework for detecting and segmenting unseen actions.<n>We evaluate our framework on two challenging human-object recognition datasets.
arXiv Detail & Related papers (2025-07-01T14:00:39Z)
Two by Two: Learning Multi-Task Pairwise Objects Assembly for Generalizable Robot Manipulation [29.02679318985968]
Existing benchmarks and datasets predominantly focus on assembling geometric fragments or factory parts. We present 2BY2, a large-scale annotated dataset for daily pairwise objects assembly. We propose a two-step SE(3) pose estimation method with equivariant features for assembly constraints.
arXiv Detail & Related papers (2025-04-09T15:12:38Z)
IAAO: Interactive Affordance Learning for Articulated Objects in 3D Environments [56.85804719947]
We present IAAO, a framework that builds an explicit 3D model for intelligent agents to gain understanding of articulated objects in their environment through interaction. We first build hierarchical features and label fields for each object state using 3D Gaussian Splatting (3DGS) by distilling mask features and view-consistent labels from multi-view images. We then perform object- and part-level queries on the 3D Gaussian primitives to identify static and articulated elements, estimating global transformations and local articulation parameters along with affordances.
arXiv Detail & Related papers (2025-04-09T12:36:48Z)
Simultaneous Detection and Interaction Reasoning for Object-Centric Action Recognition [21.655278000690686]
We propose an end-to-end object-centric action recognition framework. It simultaneously performs Detection And Interaction Reasoning in one stage. We conduct experiments on two datasets, Something-Else and Ikea-Assembly.
arXiv Detail & Related papers (2024-04-18T05:06:12Z)
ASDF: Assembly State Detection Utilizing Late Fusion by Integrating 6D Pose Estimation [5.117781843071097]
In medical and industrial domains, providing guidance for assembly processes can be critical to ensure efficiency and safety. In order to enable in-situ visualization, 6D pose estimation can be leveraged to identify the correct location for an augmentation. We build upon the strengths of YOLOv8, a real-time capable object detection framework, to address the challenges of 6D pose estimation in combination with assembly state detection.
arXiv Detail & Related papers (2024-03-25T03:30:37Z)
Localizing Active Objects from Egocentric Vision with Symbolic World Knowledge [62.981429762309226]
The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually. We propose to improve phrase grounding models' ability on localizing the active objects by: learning the role of objects undergoing change and extracting them accurately from the instructions. We evaluate our framework on Ego4D and Epic-Kitchens datasets.
arXiv Detail & Related papers (2023-10-23T16:14:05Z)
ATTACH Dataset: Annotated Two-Handed Assembly Actions for Human Action Understanding [8.923830513183882]
We present the ATTACH dataset, which contains 51.6 hours of assembly with 95.2k annotated fine-grained actions monitored by three cameras. In the ATTACH dataset, more than 68% of annotations overlap with other annotations, which is many times more than in related datasets. We report the performance of state-of-the-art methods for action recognition as well as action detection on video and skeleton-sequence inputs.
arXiv Detail & Related papers (2023-04-17T12:31:24Z)
Part-aware Prototypical Graph Network for One-shot Skeleton-based Action Recognition [57.86960990337986]
One-shot skeleton-based action recognition poses unique challenges in learning transferable representation from base classes to novel classes. We propose a part-aware prototypical representation for one-shot skeleton-based action recognition. We demonstrate the effectiveness of our method on two public skeleton-based action recognition datasets.
arXiv Detail & Related papers (2022-08-19T04:54:56Z)
Contrastive Object Detection Using Knowledge Graph Embeddings [72.17159795485915]
We compare the error statistics of the class embeddings learned from a one-hot approach with semantically structured embeddings from natural language processing or knowledge graphs. We propose a knowledge-embedded design for keypoint-based and transformer-based object detection architectures.
arXiv Detail & Related papers (2021-12-21T17:10:21Z)
Robust Object Detection via Instance-Level Temporal Cycle Confusion [89.1027433760578]
We study the effectiveness of auxiliary self-supervised tasks to improve the out-of-distribution generalization of object detectors. Inspired by the principle of maximum entropy, we introduce a novel self-supervised task, instance-level temporal cycle confusion (CycConf) For each object, the task is to find the most different object proposals in the adjacent frame in a video and then cycle back to itself for self-supervision.
arXiv Detail & Related papers (2021-04-16T21:35:08Z)
Unveiling the Potential of Structure-Preserving for Weakly Supervised Object Localization [71.79436685992128]
We propose a two-stage approach, termed structure-preserving activation (SPA), towards fully leveraging the structure information incorporated in convolutional features for WSOL. In the first stage, a restricted activation module (RAM) is designed to alleviate the structure-missing issue caused by the classification network. In the second stage, we propose a post-process approach, termed self-correlation map generating (SCG) module to obtain structure-preserving localization maps.
arXiv Detail & Related papers (2021-03-08T03:04:14Z)
Interactive Fusion of Multi-level Features for Compositional Activity Recognition [100.75045558068874]
We present a novel framework that accomplishes this goal by interactive fusion. We implement the framework in three steps, namely, positional-to-appearance feature extraction, semantic feature interaction, and semantic-to-positional prediction. We evaluate our approach on two action recognition datasets, Something-Something and Charades.
arXiv Detail & Related papers (2020-12-10T14:17:18Z)
SAFCAR: Structured Attention Fusion for Compositional Action Recognition [47.43959215267547]
We develop and test a novel Structured Attention Fusion (SAF) self-attention mechanism to combine information from object detections. We show that our approach recognizes novel verb-noun compositions more effectively than current state of the art systems. We validate our approach on the challenging Something-Else tasks from the Something-Something-V2 dataset.
arXiv Detail & Related papers (2020-12-03T17:45:01Z)
Object-Driven Active Mapping for More Accurate Object Pose Estimation and Robotic Grasping [5.385583891213281]
The framework is built on an object SLAM system integrated with a simultaneous multi-object pose estimation process. By combining the mapping module and the exploration strategy, an accurate object map that is compatible with robotic grasping can be generated.
arXiv Detail & Related papers (2020-12-03T09:36:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.