MS-LSTM: Exploring Spatiotemporal Multiscale Representations in Video
Prediction Domain
- URL: http://arxiv.org/abs/2304.07724v3
- Date: Fri, 16 Feb 2024 07:11:05 GMT
- Title: MS-LSTM: Exploring Spatiotemporal Multiscale Representations in Video
Prediction Domain
- Authors: Zhifeng Ma, Hao Zhang, Jie Liu
- Abstract summary: Existing RNN models obtain the multi-scale of features only by stacking layers.
This paper proposes MS-LSTM wholly from a multi-scale perspective.
We theoretically analyze the training cost and performance of MS-LSTM and its components.
- Score: 8.216911980865902
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The drastic variation of motion in spatial and temporal dimensions makes the
video prediction task extremely challenging. Existing RNN models obtain higher
performance by deepening or widening the model. They obtain the multi-scale
features of the video only by stacking layers, which is inefficient and brings
unbearable training costs (such as memory, FLOPs, and training time). Different
from them, this paper proposes a spatiotemporal multi-scale model called
MS-LSTM wholly from a multi-scale perspective. On the basis of stacked layers,
MS-LSTM incorporates two additional efficient multi-scale designs to fully
capture spatiotemporal context information. Concretely, we employ LSTMs with
mirrored pyramid structures to construct spatial multi-scale representations
and LSTMs with different convolution kernels to construct temporal multi-scale
representations. We theoretically analyze the training cost and performance of
MS-LSTM and its components. Detailed comparison experiments with twelve
baseline models on four video datasets show that MS-LSTM has better performance
but lower training costs.
Related papers
- xLSTM-UNet can be an Effective 2D & 3D Medical Image Segmentation Backbone with Vision-LSTM (ViL) better than its Mamba Counterpart [13.812935743270517]
We propose xLSTM-UNet, a UNet structured deep learning neural network that leverages Vision-LSTM (xLSTM) as its backbone for medical image segmentation.
xLSTM is a recently proposed as the successor of Long Short-Term Memory (LSTM) networks.
Our findings demonstrate that xLSTM-UNet consistently surpasses the performance of leading CNN-based, Transformer-based, and Mamba-based segmentation networks.
arXiv Detail & Related papers (2024-07-01T17:59:54Z) - ST-LLM: Large Language Models Are Effective Temporal Learners [58.79456373423189]
Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation.
How to effectively encode and understand videos in video-based dialogue systems remains to be solved.
We propose ST-LLM, an effective video-LLM baseline with spatial-temporal sequence modeling inside LLM.
arXiv Detail & Related papers (2024-03-30T10:11:26Z) - HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models [70.25499865569353]
We introduce HyperLLaVA, which involves adaptive tuning of the projector and LLM parameters, in conjunction with a dynamic visual expert and language expert.
Our solution significantly surpasses LLaVA on existing MLLM benchmarks, including MME, MMBench, SEED-Bench, and LLaVA-Bench.
arXiv Detail & Related papers (2024-03-20T09:42:43Z) - $λ$-ECLIPSE: Multi-Concept Personalized Text-to-Image Diffusion Models by Leveraging CLIP Latent Space [61.091910046492345]
$lambda$-ECLIPSE works in the latent space of a pre-trained CLIP model without relying on the diffusion UNet models.
$lambda$-ECLIPSE performs multisubject driven P-T2I with just 34M parameters and is trained on a mere 74 GPU hours.
arXiv Detail & Related papers (2024-02-07T19:07:10Z) - Large AI Model Empowered Multimodal Semantic Communications [51.17527319441436]
We propose a Large AI Model-based Multimodal SC (LAM-MSC) framework.
We first present the SC-based Multimodal Alignment (MMA)
Then, a personalized LLM-based Knowledge Base (LKB) is proposed.
Finally, we apply the Conditional Generative adversarial networks-based channel Estimation (CGE) to obtain Channel State Information (CSI)
arXiv Detail & Related papers (2023-09-03T19:24:34Z) - Algorithm and Hardware Co-Design of Energy-Efficient LSTM Networks for
Video Recognition with Hierarchical Tucker Tensor Decomposition [22.502146009817416]
Long short-term memory (LSTM) is a powerful deep neural network that has been widely used in sequence analysis and modeling applications.
In this paper, we propose to perform algorithm and hardware co-design towards high-performance energy-efficient LSTM networks.
arXiv Detail & Related papers (2022-12-05T05:51:56Z) - A journey in ESN and LSTM visualisations on a language task [77.34726150561087]
We trained ESNs and LSTMs on a Cross-Situationnal Learning (CSL) task.
The results are of three kinds: performance comparison, internal dynamics analyses and visualization of latent space.
arXiv Detail & Related papers (2020-12-03T08:32:01Z) - Deep Learning modeling of Limit Order Book: a comparative perspective [0.0]
The present work addresses theoretical and practical questions in the domain of Deep Learning for High Frequency Trading.
State-of-the-art models such as Random models, Logistic Regressions, LSTMs, LSTMs equipped with an Attention mask, CNN-LSTM and Attentions are reviewed and compared on the same tasks.
The underlying dimensions of the modeling techniques are investigated to understand whether these are intrinsic to the Limit Order Book's dynamics.
arXiv Detail & Related papers (2020-07-12T17:06:30Z) - Multi-view Frequency LSTM: An Efficient Frontend for Automatic Speech
Recognition [4.753402561130792]
We present a simple and efficient modification by combining the outputs of multiple FLSTM stacks with different views.
We show that the multi-view FLSTM acoustic model provides relative Word Error Rate (WER) improvements of 3-7% for different speaker and acoustic environment scenarios.
arXiv Detail & Related papers (2020-06-30T22:19:53Z) - Sentiment Analysis Using Simplified Long Short-term Memory Recurrent
Neural Networks [1.5146765382501612]
We perform sentiment analysis on a GOP Debate Twitter dataset.
To speed up training and reduce the computational cost and time, six different parameter reduced slim versions of the LSTM model are proposed.
arXiv Detail & Related papers (2020-05-08T12:50:10Z) - Convolutional Tensor-Train LSTM for Spatio-temporal Learning [116.24172387469994]
We propose a higher-order LSTM model that can efficiently learn long-term correlations in the video sequence.
This is accomplished through a novel tensor train module that performs prediction by combining convolutional features across time.
Our results achieve state-of-the-art performance-art in a wide range of applications and datasets.
arXiv Detail & Related papers (2020-02-21T05:00:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.