Temporal-Viewpoint Transportation Plan for Skeletal Few-shot Action
Recognition
- URL: http://arxiv.org/abs/2210.16820v1
- Date: Sun, 30 Oct 2022 11:46:38 GMT
- Title: Temporal-Viewpoint Transportation Plan for Skeletal Few-shot Action
Recognition
- Authors: Lei Wang and Piotr Koniusz
- Abstract summary: Few-shot learning pipeline for 3D skeleton-based action recognition by Joint tEmporal and cAmera viewpoiNt alIgnmEnt.
- Score: 38.27785891922479
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a Few-shot Learning pipeline for 3D skeleton-based action
recognition by Joint tEmporal and cAmera viewpoiNt alIgnmEnt (JEANIE). To
factor out misalignment between query and support sequences of 3D body joints,
we propose an advanced variant of Dynamic Time Warping which jointly models
each smooth path between the query and support frames to achieve simultaneously
the best alignment in the temporal and simulated camera viewpoint spaces for
end-to-end learning under the limited few-shot training data. Sequences are
encoded with a temporal block encoder based on Simple Spectral Graph
Convolution, a lightweight linear Graph Neural Network backbone. We also
include a setting with a transformer. Finally, we propose a similarity-based
loss which encourages the alignment of sequences of the same class while
preventing the alignment of unrelated sequences. We show state-of-the-art
results on NTU-60, NTU-120, Kinetics-skeleton and UWA3D Multiview Activity II.
Related papers
- Meet JEANIE: a Similarity Measure for 3D Skeleton Sequences via Temporal-Viewpoint Alignment [44.22075586147116]
Video sequences exhibit significant variations (undesired effects) of speed of actions, temporal locations, and subjects' poses.
We propose Joint tEmporal and cAmera viewpoiNt alIgnmEnt (JEANIE) for sequence pairs.
arXiv Detail & Related papers (2024-02-07T05:47:31Z) - Skeleton2vec: A Self-supervised Learning Framework with Contextualized
Target Representations for Skeleton Sequence [56.092059713922744]
We show that using high-level contextualized features as prediction targets can achieve superior performance.
Specifically, we propose Skeleton2vec, a simple and efficient self-supervised 3D action representation learning framework.
Our proposed Skeleton2vec outperforms previous methods and achieves state-of-the-art results.
arXiv Detail & Related papers (2024-01-01T12:08:35Z) - You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query.
Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames.
We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z) - A Light-Weight Contrastive Approach for Aligning Human Pose Sequences [1.0152838128195467]
Training samples consist of temporal windows of frames containing 3D body points such as mocap markers or skeleton joints.
A light-weight, 3-layer encoder is trained using a contrastive loss function that encourages embedding vectors of augmented sample pairs to have cosine similarity 1, and similarity 0 with all other samples in a minibatch.
In addition to being simple, the proposed method is fast to train, making it easy to adapt to new data using different marker sets or skeletal joint layouts.
arXiv Detail & Related papers (2023-03-07T21:35:02Z) - Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for
Temporal Sentence Grounding [61.57847727651068]
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query.
Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information.
We propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features.
arXiv Detail & Related papers (2022-03-06T13:57:09Z) - 3D Skeleton-based Few-shot Action Recognition with JEANIE is not so
Na\"ive [28.720272938306692]
We propose a Few-shot Learning pipeline for 3D skeleton-based action recognition by Joint tEmporal and cAmera viewpoiNt alIgnmEnt.
arXiv Detail & Related papers (2021-12-23T16:09:23Z) - Leveraging Third-Order Features in Skeleton-Based Action Recognition [26.349722372701482]
Skeleton sequences are light-weight and compact, and thus ideal candidates for action recognition on edge devices.
Recent action recognition methods extract features from 3D joint coordinates as spatial-temporal cues, using these representations in a graph neural network for feature fusion.
We propose fusing third-order features in the form of angles into modern architectures, to robustly capture the relationships between joints and body parts.
arXiv Detail & Related papers (2021-05-04T15:23:29Z) - Tensor Representations for Action Recognition [54.710267354274194]
Human actions in sequences are characterized by the complex interplay between spatial features and their temporal dynamics.
We propose novel tensor representations for capturing higher-order relationships between visual features for the task of action recognition.
We use higher-order tensors and so-called Eigenvalue Power Normalization (NEP) which have been long speculated to perform spectral detection of higher-order occurrences.
arXiv Detail & Related papers (2020-12-28T17:27:18Z) - MotioNet: 3D Human Motion Reconstruction from Monocular Video with
Skeleton Consistency [72.82534577726334]
We introduce MotioNet, a deep neural network that directly reconstructs the motion of a 3D human skeleton from monocular video.
Our method is the first data-driven approach that directly outputs a kinematic skeleton, which is a complete, commonly used, motion representation.
arXiv Detail & Related papers (2020-06-22T08:50:09Z) - Skeleton Based Action Recognition using a Stacked Denoising Autoencoder
with Constraints of Privileged Information [5.67220249825603]
We propose a new method to study the skeletal representation in a view of skeleton reconstruction.
Based on the concept of learning under privileged information, we integrate action categories and temporal coordinates into a stacked denoising autoencoder.
In order to mitigate the variation resulting from temporary misalignment, a new method of temporal registration is proposed.
arXiv Detail & Related papers (2020-03-12T09:56:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.