The Object at Hand: Automated Editing for Mixed Reality Video Guidance
from Hand-Object Interactions
- URL: http://arxiv.org/abs/2109.14744v1
- Date: Wed, 29 Sep 2021 22:24:25 GMT
- Title: The Object at Hand: Automated Editing for Mixed Reality Video Guidance
from Hand-Object Interactions
- Authors: Yao Lu, Walterio W. Mayol-Cuevas
- Abstract summary: We use egocentric vision to observe hand-object interactions in real-world tasks and automatically decompose a video into its constituent steps.
Our approach combines hand-object interaction (HOI) detection, object similarity measurement and a finite state machine (FSM) representation to automatically edit videos into steps.
- Score: 24.68535915849555
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this paper, we concern with the problem of how to automatically extract
the steps that compose real-life hand activities. This is a key competence
towards processing, monitoring and providing video guidance in Mixed Reality
systems. We use egocentric vision to observe hand-object interactions in
real-world tasks and automatically decompose a video into its constituent
steps. Our approach combines hand-object interaction (HOI) detection, object
similarity measurement and a finite state machine (FSM) representation to
automatically edit videos into steps. We use a combination of Convolutional
Neural Networks (CNNs) and the FSM to discover, edit cuts and merge segments
while observing real hand activities. We evaluate quantitatively and
qualitatively our algorithm on two datasets: the GTEA\cite{li2015delving}, and
a new dataset we introduce for Chinese Tea making. Results show our method is
able to segment hand-object interaction videos into key step segments with high
levels of precision.
Related papers
- Articulated Object Manipulation using Online Axis Estimation with SAM2-Based Tracking [59.87033229815062]
Articulated object manipulation requires precise object interaction, where the object's axis must be carefully considered.
Previous research employed interactive perception for manipulating articulated objects, but typically, open-loop approaches often suffer from overlooking the interaction dynamics.
We present a closed-loop pipeline integrating interactive perception with online axis estimation from segmented 3D point clouds.
arXiv Detail & Related papers (2024-09-24T17:59:56Z) - I-MPN: Inductive Message Passing Network for Efficient Human-in-the-Loop Annotation of Mobile Eye Tracking Data [4.487146086221174]
We present a novel human-centered learning algorithm designed for automated object recognition within mobile eye-tracking settings.
Our approach seamlessly integrates an object detector with a spatial relation-aware inductive message-passing network (I-MPN), harnessing node profile information and capturing object correlations.
arXiv Detail & Related papers (2024-06-10T13:08:31Z) - Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects [89.95728475983263]
holistic 3Dunderstanding of such interactions from egocentric views is important for tasks in robotics, AR/VR, action recognition and motion generation.
We design the HANDS23 challenge based on the AssemblyHands and ARCTIC datasets with carefully designed training and testing splits.
Based on the results of the top submitted methods and more recent baselines on the leaderboards, we perform a thorough analysis on 3D hand(-object) reconstruction tasks.
arXiv Detail & Related papers (2024-03-25T05:12:21Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Automatic Interaction and Activity Recognition from Videos of Human
Manual Demonstrations with Application to Anomaly Detection [0.0]
This paper exploits Scene Graphs to extract key interaction features from image sequences while simultaneously motion patterns and context.
The method introduces event-based automatic video segmentation and clustering, which allow for the grouping of similar events and detect if a monitored activity is executed correctly.
arXiv Detail & Related papers (2023-04-19T16:15:23Z) - SOS! Self-supervised Learning Over Sets Of Handled Objects In Egocentric
Action Recognition [35.4163266882568]
We introduce Self-Supervised Learning Over Sets (SOS) to pre-train a generic Objects In Contact (OIC) representation model.
Our OIC significantly boosts the performance of multiple state-of-the-art video classification models.
arXiv Detail & Related papers (2022-04-10T23:27:19Z) - Learning Visual Affordance Grounding from Demonstration Videos [76.46484684007706]
Affordance grounding aims to segment all possible interaction regions between people and objects from an image/video.
We propose a Hand-aided Affordance Grounding Network (HAGNet) that leverages the aided clues provided by the position and action of the hand in demonstration videos.
arXiv Detail & Related papers (2021-08-12T11:45:38Z) - Motion Guided Attention Fusion to Recognize Interactions from Videos [40.1565059238891]
We present a dual-pathway approach for recognizing fine-grained interactions from videos.
We fuse the bottom-up features in the motion pathway with features captured from object detections to learn the temporal aspects of an action.
We show that our approach can generalize across appearance effectively and recognize actions where an actor interacts with previously unseen objects.
arXiv Detail & Related papers (2021-04-01T17:44:34Z) - Learning Asynchronous and Sparse Human-Object Interaction in Videos [56.73059840294019]
Asynchronous-Sparse Interaction Graph Networks (ASSIGN) is able to automatically detect the structure of interaction events associated with entities in a video scene.
ASSIGN is tested on human-object interaction recognition and shows superior performance in segmenting and labeling of human sub-activities and object affordances from raw videos.
arXiv Detail & Related papers (2021-03-03T23:43:55Z) - "What's This?" -- Learning to Segment Unknown Objects from Manipulation
Sequences [27.915309216800125]
We present a novel framework for self-supervised grasped object segmentation with a robotic manipulator.
We propose a single, end-to-end trainable architecture which jointly incorporates motion cues and semantic knowledge.
Our method neither depends on any visual registration of a kinematic robot or 3D object models, nor on precise hand-eye calibration or any additional sensor data.
arXiv Detail & Related papers (2020-11-06T10:55:28Z) - Relational Graph Learning on Visual and Kinematics Embeddings for
Accurate Gesture Recognition in Robotic Surgery [84.73764603474413]
We propose a novel online approach of multi-modal graph network (i.e., MRG-Net) to dynamically integrate visual and kinematics information.
The effectiveness of our method is demonstrated with state-of-the-art results on the public JIGSAWS dataset.
arXiv Detail & Related papers (2020-11-03T11:00:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.