Context-Aware Sequence Alignment using 4D Skeletal Augmentation
- URL: http://arxiv.org/abs/2204.12223v1
- Date: Tue, 26 Apr 2022 10:59:29 GMT
- Title: Context-Aware Sequence Alignment using 4D Skeletal Augmentation
- Authors: Taein Kwon, Bugra Tekin, Siyu Tang, Marc Pollefeys
- Abstract summary: Temporal alignment of fine-grained human actions in videos is important for numerous applications in computer vision, robotics, and mixed reality.
We propose a novel context-aware self-supervised learning architecture to align sequences of actions.
Specifically, CASA employs self-attention and cross-attention mechanisms to incorporate the spatial and temporal context of human actions.
- Score: 67.05537307224525
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal alignment of fine-grained human actions in videos is important for
numerous applications in computer vision, robotics, and mixed reality.
State-of-the-art methods directly learn image-based embedding space by
leveraging powerful deep convolutional neural networks. While being
straightforward, their results are far from satisfactory, the aligned videos
exhibit severe temporal discontinuity without additional post-processing steps.
The recent advancements in human body and hand pose estimation in the wild
promise new ways of addressing the task of human action alignment in videos. In
this work, based on off-the-shelf human pose estimators, we propose a novel
context-aware self-supervised learning architecture to align sequences of
actions. We name it CASA. Specifically, CASA employs self-attention and
cross-attention mechanisms to incorporate the spatial and temporal context of
human actions, which can solve the temporal discontinuity problem. Moreover, we
introduce a self-supervised learning scheme that is empowered by novel 4D
augmentation techniques for 3D skeleton representations. We systematically
evaluate the key components of our method. Our experiments on three public
datasets demonstrate CASA significantly improves phase progress and Kendall's
Tau scores over the previous state-of-the-art methods.
Related papers
- Past Movements-Guided Motion Representation Learning for Human Motion Prediction [0.0]
We propose a self-supervised learning framework designed to enhance motion representation.
The framework consists of two stages: first, the network is pretrained through the self-reconstruction of past sequences, and the guided reconstruction of future sequences based on past movements.
Our method reduces the average prediction errors by 8.8% across Human3.6, 3DPW, and AMASS datasets.
arXiv Detail & Related papers (2024-08-04T17:00:37Z) - Self-Explainable Affordance Learning with Embodied Caption [63.88435741872204]
We introduce Self-Explainable Affordance learning (SEA) with embodied caption.
SEA enables robots to articulate their intentions and bridge the gap between explainable vision-language caption and visual affordance learning.
We propose a novel model to effectively combine affordance grounding with self-explanation in a simple but efficient manner.
arXiv Detail & Related papers (2024-04-08T15:22:38Z) - Spatio-Temporal Branching for Motion Prediction using Motion Increments [55.68088298632865]
Human motion prediction (HMP) has emerged as a popular research topic due to its diverse applications.
Traditional methods rely on hand-crafted features and machine learning techniques.
We propose a noveltemporal-temporal branching network using incremental information for HMP.
arXiv Detail & Related papers (2023-08-02T12:04:28Z) - Leaping Into Memories: Space-Time Deep Feature Synthesis [93.10032043225362]
We propose LEAPS, an architecture-independent method for synthesizing videos from internal models.
We quantitatively and qualitatively evaluate the applicability of LEAPS by inverting a range of architectures convolutional attention-based on Kinetics-400.
arXiv Detail & Related papers (2023-03-17T12:55:22Z) - Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based
Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage.
We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets.
By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z) - A Spatio-Temporal Multilayer Perceptron for Gesture Recognition [70.34489104710366]
We propose a multilayer state-weighted perceptron for gesture recognition in the context of autonomous vehicles.
An evaluation of TCG and Drive&Act datasets is provided to showcase the promising performance of our approach.
We deploy our model to our autonomous vehicle to show its real-time capability and stable execution.
arXiv Detail & Related papers (2022-04-25T08:42:47Z) - A Kinematic Bottleneck Approach For Pose Regression of Flexible Surgical
Instruments directly from Images [17.32860829016479]
We propose a self-supervised image-based method, exploiting, at training time only, the kinematic information provided by the robot.
In order to avoid introducing time-consuming manual annotations, the problem is formulated as an auto-encoder.
Validation of the method was performed on semi-synthetic, phantom and in-vivo datasets, obtained using a flexible robotized endoscope.
arXiv Detail & Related papers (2021-02-28T18:41:18Z) - A Graph Attention Spatio-temporal Convolutional Network for 3D Human
Pose Estimation in Video [7.647599484103065]
We improve the learning of constraints in human skeleton by modeling local global spatial information via attention mechanisms.
Our approach effectively mitigates depth ambiguity and self-occlusion, generalizes to half upper body estimation, and achieves competitive performance on 2D-to-3D video pose estimation.
arXiv Detail & Related papers (2020-03-11T14:54:40Z) - Following Instructions by Imagining and Reaching Visual Goals [8.19944635961041]
We present a novel framework for learning to perform temporally extended tasks using spatial reasoning.
Our framework operates on raw pixel images, assumes no prior linguistic or perceptual knowledge, and learns via intrinsic motivation.
We validate our method in two environments with a robot arm in a simulated interactive 3D environment.
arXiv Detail & Related papers (2020-01-25T23:26:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.