Parallelized Spatiotemporal Binding
- URL: http://arxiv.org/abs/2402.17077v1
- Date: Mon, 26 Feb 2024 23:16:34 GMT
- Title: Parallelized Spatiotemporal Binding
- Authors: Gautam Singh, Yue Wang, Jiawei Yang, Boris Ivanovic, Sungjin Ahn,
Marco Pavone, Tong Che
- Abstract summary: We introduce Parallelizable Spatiotemporal Binder or PSB, the first temporally-parallelizable slot learning architecture for sequential inputs.
Unlike conventional RNN-based approaches, PSB produces object-centric representations, known as slots, for all time-steps in parallel.
Compared to the state-of-the-art, our architecture demonstrates stable training on longer sequences, achieves parallelization that results in a 60% increase in training speed, and yields performance that is on par with or better on unsupervised 2D and 3D object-centric scene decomposition and understanding.
- Score: 47.67393266882402
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While modern best practices advocate for scalable architectures that support
long-range interactions, object-centric models are yet to fully embrace these
architectures. In particular, existing object-centric models for handling
sequential inputs, due to their reliance on RNN-based implementation, show poor
stability and capacity and are slow to train on long sequences. We introduce
Parallelizable Spatiotemporal Binder or PSB, the first
temporally-parallelizable slot learning architecture for sequential inputs.
Unlike conventional RNN-based approaches, PSB produces object-centric
representations, known as slots, for all time-steps in parallel. This is
achieved by refining the initial slots across all time-steps through a fixed
number of layers equipped with causal attention. By capitalizing on the
parallelism induced by our architecture, the proposed model exhibits a
significant boost in efficiency. In experiments, we test PSB extensively as an
encoder within an auto-encoding framework paired with a wide variety of decoder
options. Compared to the state-of-the-art, our architecture demonstrates stable
training on longer sequences, achieves parallelization that results in a 60%
increase in training speed, and yields performance that is on par with or
better on unsupervised 2D and 3D object-centric scene decomposition and
understanding.
Related papers
- TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals [58.865901821451295]
We present a novel two-stream feature fusion "Tensor-Convolution and Convolution-Transformer Network" (TCCT-Net) architecture.
To better learn the meaningful patterns in the temporal-spatial domain, we design a "CT" stream that integrates a hybrid convolutional-transformer.
In parallel, to efficiently extract rich patterns from the temporal-frequency domain, we introduce a "TC" stream that uses Continuous Wavelet Transform (CWT) to represent information in a 2D tensor form.
arXiv Detail & Related papers (2024-04-15T06:01:48Z) - Multi-Level Aggregation and Recursive Alignment Architecture for Efficient Parallel Inference Segmentation Network [18.47001817385548]
We propose a parallel inference network customized for semantic segmentation tasks.
We employ a shallow backbone to ensure real-time speed, and propose three core components to compensate for the reduced model capacity to improve accuracy.
Our framework shows a better balance between speed and accuracy than state-of-the-art real-time methods on Cityscapes and CamVid datasets.
arXiv Detail & Related papers (2024-02-03T22:51:17Z) - ViR: Towards Efficient Vision Retention Backbones [97.93707844681893]
We propose a new class of computer vision models, dubbed Vision Retention Networks (ViR)
ViR has dual parallel and recurrent formulations, which strike an optimal balance between fast inference and parallel training with competitive performance.
We have validated the effectiveness of ViR through extensive experiments with different dataset sizes and various image resolutions.
arXiv Detail & Related papers (2023-10-30T16:55:50Z) - FormerTime: Hierarchical Multi-Scale Representations for Multivariate
Time Series Classification [53.55504611255664]
FormerTime is a hierarchical representation model for improving the classification capacity for the multivariate time series classification task.
It exhibits three aspects of merits: (1) learning hierarchical multi-scale representations from time series data, (2) inheriting the strength of both transformers and convolutional networks, and (3) tacking the efficiency challenges incurred by the self-attention mechanism.
arXiv Detail & Related papers (2023-02-20T07:46:14Z) - Physics-inspired Ising Computing with Ring Oscillator Activated p-bits [0.0]
We design and implement a truly asynchronous and medium-scale p-computer with $$ 800 p-bits.
We evaluate the performance of the asynchronous architecture against an ideal, synchronous design.
Our results highlight the promise of massively scaled p-computers with millions of free-running p-bits.
arXiv Detail & Related papers (2022-05-15T23:46:58Z) - Large Scale Time-Series Representation Learning via Simultaneous Low and
High Frequency Feature Bootstrapping [7.0064929761691745]
We propose a non-contrastive self-supervised learning approach efficiently captures low and high-frequency time-varying features.
Our method takes raw time series data as input and creates two different augmented views for two branches of the model.
To demonstrate the robustness of our model we performed extensive experiments and ablation studies on five real-world time-series datasets.
arXiv Detail & Related papers (2022-04-24T14:39:47Z) - Model-Architecture Co-Design for High Performance Temporal GNN Inference
on FPGA [5.575293536755127]
Real-world applications require high performance inference on real-time streaming dynamic graphs.
We present a novel model-architecture co-design for inference in memory-based TGNNs on FPGAs.
We train our simplified models using knowledge distillation to ensure similar accuracy vis-'a-vis the original model.
arXiv Detail & Related papers (2022-03-10T00:24:47Z) - Parallel Spatio-Temporal Attention-Based TCN for Multivariate Time
Series Prediction [4.211344046281808]
A recurrent neural network with attention to help extend the prediction windows is the current-state-of-the-art for this task.
We argue that their vanishing gradients, short memories, and serial architecture make RNNs fundamentally unsuited to long-horizon forecasting with complex data.
We propose a framework called PSTA-TCN, that combines a paralleltemporal-temporal attention mechanism to extract dynamic internal correlations with stacked TCN backbones.
arXiv Detail & Related papers (2022-03-02T09:27:56Z) - Sketching as a Tool for Understanding and Accelerating Self-attention
for Long Sequences [52.6022911513076]
Transformer-based models are not efficient in processing long sequences due to the quadratic space and time complexity of the self-attention modules.
We propose Linformer and Informer to reduce the quadratic complexity to linear (modulo logarithmic factors) via low-dimensional projection and row selection.
Based on the theoretical analysis, we propose Skeinformer to accelerate self-attention and further improve the accuracy of matrix approximation to self-attention.
arXiv Detail & Related papers (2021-12-10T06:58:05Z) - Encoder-decoder with Multi-level Attention for 3D Human Shape and Pose
Estimation [61.98690211671168]
We propose a Multi-level Attention-Decoder Network (MAED) to model multi-level attentions in a unified framework.
With the training set of 3DPW, MAED outperforms previous state-of-the-art methods by 6.2, 7.2, and 2.4 mm of PA-MPJPE.
arXiv Detail & Related papers (2021-09-06T09:06:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.