Technical Report: Masked Skeleton Sequence Modeling for Learning Larval Zebrafish Behavior Latent Embeddings
- URL: http://arxiv.org/abs/2403.15693v1
- Date: Sat, 23 Mar 2024 02:58:10 GMT
- Title: Technical Report: Masked Skeleton Sequence Modeling for Learning Larval Zebrafish Behavior Latent Embeddings
- Authors: Lanxin Xu, Shuo Wang,
- Abstract summary: We introduce a novel self-supervised learning method for extracting latent embeddings from behaviors of larval zebrafish.
For the skeletal sequences of swimming zebrafish, we propose a pioneering Transformer-CNN architecture, the Sequence Spatial-Temporal Transformer (SSTFormer)
- Score: 5.922172844641853
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this report, we introduce a novel self-supervised learning method for extracting latent embeddings from behaviors of larval zebrafish. Drawing inspiration from Masked Modeling techniquesutilized in image processing with Masked Autoencoders (MAE) \cite{he2022masked} and in natural language processing with Generative Pre-trained Transformer (GPT) \cite{radford2018improving}, we treat behavior sequences as a blend of images and language. For the skeletal sequences of swimming zebrafish, we propose a pioneering Transformer-CNN architecture, the Sequence Spatial-Temporal Transformer (SSTFormer), designed to capture the inter-frame correlation of different joints. This correlation is particularly valuable, as it reflects the coordinated movement of various parts of the fish body across adjacent frames. To handle the high frame rate, we segment the skeleton sequence into distinct time slices, analogous to "words" in a sentence, and employ self-attention transformer layers to encode the consecutive frames within each slice, capturing the spatial correlation among different joints. Furthermore, we incorporate a CNN-based attention module to enhance the representations outputted by the transformer layers. Lastly, we introduce a temporal feature aggregation operation between time slices to improve the discrimination of similar behaviors.
Related papers
- Leveraging 2D Information for Long-term Time Series Forecasting with Vanilla Transformers [55.475142494272724]
Time series prediction is crucial for understanding and forecasting complex dynamics in various domains.
We introduce GridTST, a model that combines the benefits of two approaches using innovative multi-directional attentions.
The model consistently delivers state-of-the-art performance across various real-world datasets.
arXiv Detail & Related papers (2024-05-22T16:41:21Z) - Co-Speech Gesture Detection through Multi-Phase Sequence Labeling [3.924524252255593]
We introduce a novel framework that reframes the task as a multi-phase sequence labeling problem.
We evaluate our proposal on a large dataset of diverse co-speech gestures in task-oriented face-to-face dialogues.
arXiv Detail & Related papers (2023-08-21T12:27:18Z) - Learning to Mask and Permute Visual Tokens for Vision Transformer
Pre-Training [59.923672191632065]
We propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT)
MaPeT employs autoregressive and permuted predictions to capture intra-patch dependencies.
Our results demonstrate that MaPeT achieves competitive performance on ImageNet.
arXiv Detail & Related papers (2023-06-12T18:12:19Z) - STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition [50.064502884594376]
We study the problem of human action recognition using motion capture (MoCap) sequences.
We propose a novel Spatial-Temporal Mesh Transformer (STMT) to directly model the mesh sequences.
The proposed method achieves state-of-the-art performance compared to skeleton-based and point-cloud-based models.
arXiv Detail & Related papers (2023-03-31T16:19:27Z) - TimeMAE: Self-Supervised Representations of Time Series with Decoupled
Masked Autoencoders [55.00904795497786]
We propose TimeMAE, a novel self-supervised paradigm for learning transferrable time series representations based on transformer networks.
The TimeMAE learns enriched contextual representations of time series with a bidirectional encoding scheme.
To solve the discrepancy issue incurred by newly injected masked embeddings, we design a decoupled autoencoder architecture.
arXiv Detail & Related papers (2023-03-01T08:33:16Z) - Temporal-Viewpoint Transportation Plan for Skeletal Few-shot Action
Recognition [38.27785891922479]
Few-shot learning pipeline for 3D skeleton-based action recognition by Joint tEmporal and cAmera viewpoiNt alIgnmEnt.
arXiv Detail & Related papers (2022-10-30T11:46:38Z) - Spatio-Temporal Transformer for Dynamic Facial Expression Recognition in
the Wild [19.5702895176141]
We propose a method for capturing discnative features within each frame model.
We utilize the CNN to translate each frame into a visual feature sequence.
Experiments indicate that our method provides an effective way to make use of the spatial and temporal dependencies.
arXiv Detail & Related papers (2022-05-10T08:47:15Z) - Spatio-Temporal Tuples Transformer for Skeleton-Based Action Recognition [8.905895607185135]
Transformer shows great potential to model the correlation of important joints.
Existing Transformer-based methods cannot capture the correlation of different joints between frames.
Atemporal-recognition module is proposed to capture the relationship of different joints in consecutive frames.
arXiv Detail & Related papers (2022-01-08T16:03:01Z) - Skeleton-Aware Networks for Deep Motion Retargeting [83.65593033474384]
We introduce a novel deep learning framework for data-driven motion between skeletons.
Our approach learns how to retarget without requiring any explicit pairing between the motions in the training set.
arXiv Detail & Related papers (2020-05-12T12:51:40Z) - Image Morphing with Perceptual Constraints and STN Alignment [70.38273150435928]
We propose a conditional GAN morphing framework operating on a pair of input images.
A special training protocol produces sequences of frames, combined with a perceptual similarity loss, promote smooth transformation over time.
We provide comparisons to classic as well as latent space morphing techniques, and demonstrate that, given a set of images for self-supervision, our network learns to generate visually pleasing morphing effects.
arXiv Detail & Related papers (2020-04-29T10:49:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.