Self-supervised Extraction of Human Motion Structures via Frame-wise
Discrete Features
- URL: http://arxiv.org/abs/2309.05972v1
- Date: Tue, 12 Sep 2023 05:43:13 GMT
- Title: Self-supervised Extraction of Human Motion Structures via Frame-wise
Discrete Features
- Authors: Tetsuya Abe, Ryusuke Sagawa, Ko Ayusawa, Wataru Takano
- Abstract summary: We propose an encoder-decoder model for extracting the structures of human motions represented by frame-wise discrete features in a self-supervised manner.
In our experiments, the sparse structures of motion codes were used to compile a graph that facilitates visualization of the relationship between the codes and the differences between sequences.
- Score: 2.239394800147746
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The present paper proposes an encoder-decoder model for extracting the
structures of human motions represented by frame-wise discrete features in a
self-supervised manner. In the proposed method, features are extracted as codes
in a motion codebook without the use of human knowledge, and the relationship
between these codes can be visualized on a graph. Since the codes are expected
to be temporally sparse compared to the captured frame rate and can be shared
by multiple sequences, the proposed network model also addresses the need for
training constraints. Specifically, the model consists of self-attention layers
and a vector clustering block. The attention layers contribute to finding
sparse keyframes and discrete features as motion codes, which are then
extracted by vector clustering. The constraints are realized as training losses
so that the same motion codes can be as contiguous as possible and can be
shared by multiple sequences. In addition, we propose the use of causal
self-attention as a method by which to calculate attention for long sequences
consisting of numerous frames. In our experiments, the sparse structures of
motion codes were used to compile a graph that facilitates visualization of the
relationship between the codes and the differences between sequences. We then
evaluated the effectiveness of the extracted motion codes by applying them to
multiple recognition tasks and found that performance levels comparable to
task-optimized methods could be achieved by linear probing.
Related papers
- Associative Knowledge Graphs for Efficient Sequence Storage and Retrieval [3.355436702348694]
We create associative knowledge graphs that are highly effective for storing and recognizing sequences.
Individual objects (represented as nodes) can be a part of multiple sequences or appear repeatedly within a single sequence.
This approach has potential applications in diverse fields, such as anomaly detection in financial transactions or predicting user behavior based on past actions.
arXiv Detail & Related papers (2024-11-19T13:00:31Z) - Enhancing Graph Contrastive Learning with Reliable and Informative Augmentation for Recommendation [84.45144851024257]
CoGCL aims to enhance graph contrastive learning by constructing contrastive views with stronger collaborative information via discrete codes.
We introduce a multi-level vector quantizer in an end-to-end manner to quantize user and item representations into discrete codes.
For neighborhood structure, we propose virtual neighbor augmentation by treating discrete codes as virtual neighbors.
Regarding semantic relevance, we identify similar users/items based on shared discrete codes and interaction targets to generate the semantically relevant view.
arXiv Detail & Related papers (2024-09-09T14:04:17Z) - DiffCut: Catalyzing Zero-Shot Semantic Segmentation with Diffusion Features and Recursive Normalized Cut [62.63481844384229]
Foundation models have emerged as powerful tools across various domains including language, vision, and multimodal tasks.
In this paper, we use a diffusion UNet encoder as a foundation vision encoder and introduce DiffCut, an unsupervised zero-shot segmentation method.
Our work highlights the remarkably accurate semantic knowledge embedded within diffusion UNet encoders that could then serve as foundation vision encoders for downstream tasks.
arXiv Detail & Related papers (2024-06-05T01:32:31Z) - Dynamic Perceiver for Efficient Visual Recognition [87.08210214417309]
We propose Dynamic Perceiver (Dyn-Perceiver) to decouple the feature extraction procedure and the early classification task.
A feature branch serves to extract image features, while a classification branch processes a latent code assigned for classification tasks.
Early exits are placed exclusively within the classification branch, thus eliminating the need for linear separability in low-level features.
arXiv Detail & Related papers (2023-06-20T03:00:22Z) - Vector Quantized Wasserstein Auto-Encoder [57.29764749855623]
We study learning deep discrete representations from the generative viewpoint.
We endow discrete distributions over sequences of codewords and learn a deterministic decoder that transports the distribution over the sequences of codewords to the data distribution.
We develop further theories to connect it with the clustering viewpoint of WS distance, allowing us to have a better and more controllable clustering solution.
arXiv Detail & Related papers (2023-02-12T13:51:36Z) - Graph-Collaborated Auto-Encoder Hashing for Multi-view Binary Clustering [11.082316688429641]
We propose a hashing algorithm based on auto-encoders for multi-view binary clustering.
Specifically, we propose a multi-view affinity graphs learning model with low-rank constraint, which can mine the underlying geometric information from multi-view data.
We also design an encoder-decoder paradigm to collaborate the multiple affinity graphs, which can learn a unified binary code effectively.
arXiv Detail & Related papers (2023-01-06T12:43:13Z) - Semi-Structured Object Sequence Encoders [9.257633944317735]
We focus on the problem of developing a structure-aware input representation for semi-structured object sequences.
This type of data is often represented as a sequence of sets of key-value pairs over time.
We propose a two-part approach, which first considers each key independently and encodes a representation of its values over time.
arXiv Detail & Related papers (2023-01-03T09:19:41Z) - Frame-wise Action Representations for Long Videos via Sequence
Contrastive Learning [44.412145665354736]
We introduce a novel contrastive action representation learning framework to learn frame-wise action representations.
Inspired by the recent progress of self-supervised learning, we present a novel sequence contrastive loss (SCL) applied on two correlated views.
Our approach also shows outstanding performance on video alignment and fine-grained frame retrieval tasks.
arXiv Detail & Related papers (2022-03-28T17:59:54Z) - Correlation-Aware Deep Tracking [83.51092789908677]
We propose a novel target-dependent feature network inspired by the self-/cross-attention scheme.
Our network deeply embeds cross-image feature correlation in multiple layers of the feature network.
Our model can be flexibly pre-trained on abundant unpaired images, leading to notably faster convergence than the existing methods.
arXiv Detail & Related papers (2022-03-03T11:53:54Z) - Tensor Representations for Action Recognition [54.710267354274194]
Human actions in sequences are characterized by the complex interplay between spatial features and their temporal dynamics.
We propose novel tensor representations for capturing higher-order relationships between visual features for the task of action recognition.
We use higher-order tensors and so-called Eigenvalue Power Normalization (NEP) which have been long speculated to perform spectral detection of higher-order occurrences.
arXiv Detail & Related papers (2020-12-28T17:27:18Z) - Unsupervised Spatio-temporal Latent Feature Clustering for
Multiple-object Tracking and Segmentation [0.5591659577198183]
We propose a strategy that treats the temporal identification task as a heterogeneous-temporal clustering problem.
We use a convolutional and fully connected autoencoder to learn discriminative features from segmentation masks and detection bounding boxes.
Our results show that our technique outperforms several state-of-the-art methods.
arXiv Detail & Related papers (2020-07-14T16:47:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.