MS-LSTM: Exploring Spatiotemporal Multiscale Representations in Video
Prediction Domain
- URL: http://arxiv.org/abs/2304.07724v3
- Date: Fri, 16 Feb 2024 07:11:05 GMT
- Title: MS-LSTM: Exploring Spatiotemporal Multiscale Representations in Video
Prediction Domain
- Authors: Zhifeng Ma, Hao Zhang, Jie Liu
- Abstract summary: Existing RNN models obtain the multi-scale of features only by stacking layers.
This paper proposes MS-LSTM wholly from a multi-scale perspective.
We theoretically analyze the training cost and performance of MS-LSTM and its components.
- Score: 8.216911980865902
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The drastic variation of motion in spatial and temporal dimensions makes the
video prediction task extremely challenging. Existing RNN models obtain higher
performance by deepening or widening the model. They obtain the multi-scale
features of the video only by stacking layers, which is inefficient and brings
unbearable training costs (such as memory, FLOPs, and training time). Different
from them, this paper proposes a spatiotemporal multi-scale model called
MS-LSTM wholly from a multi-scale perspective. On the basis of stacked layers,
MS-LSTM incorporates two additional efficient multi-scale designs to fully
capture spatiotemporal context information. Concretely, we employ LSTMs with
mirrored pyramid structures to construct spatial multi-scale representations
and LSTMs with different convolution kernels to construct temporal multi-scale
representations. We theoretically analyze the training cost and performance of
MS-LSTM and its components. Detailed comparison experiments with twelve
baseline models on four video datasets show that MS-LSTM has better performance
but lower training costs.
Related papers
- LLaVA-KD: A Framework of Distilling Multimodal Large Language Models [70.19607283302712]
We propose a novel framework to transfer knowledge from l-MLLM to s-MLLM.
Specifically, we introduce Multimodal Distillation (MDist) to minimize the divergence between the visual-textual output distributions of l-MLLM and s-MLLM.
We also propose a three-stage training scheme to fully exploit the potential of s-MLLM.
arXiv Detail & Related papers (2024-10-21T17:41:28Z) - EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical Alignment [39.870809905905325]
We propose Empowering Multi-modal Mamba with Structural and Hierarchical Alignment (EMMA) to extract fine-grained visual information.
Our model shows lower latency than other Mamba-based MLLMs and is nearly four times faster than transformer-based MLLMs of similar scale during inference.
arXiv Detail & Related papers (2024-10-08T11:41:55Z) - NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks.
We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities.
We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z) - xLSTM-UNet can be an Effective 2D & 3D Medical Image Segmentation Backbone with Vision-LSTM (ViL) better than its Mamba Counterpart [13.812935743270517]
We propose xLSTM-UNet, a UNet structured deep learning neural network that leverages Vision-LSTM (xLSTM) as its backbone for medical image segmentation.
xLSTM is a recently proposed as the successor of Long Short-Term Memory (LSTM) networks.
Our findings demonstrate that xLSTM-UNet consistently surpasses the performance of leading CNN-based, Transformer-based, and Mamba-based segmentation networks.
arXiv Detail & Related papers (2024-07-01T17:59:54Z) - ST-LLM: Large Language Models Are Effective Temporal Learners [58.79456373423189]
Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation.
How to effectively encode and understand videos in video-based dialogue systems remains to be solved.
We propose ST-LLM, an effective video-LLM baseline with spatial-temporal sequence modeling inside LLM.
arXiv Detail & Related papers (2024-03-30T10:11:26Z) - $λ$-ECLIPSE: Multi-Concept Personalized Text-to-Image Diffusion Models by Leveraging CLIP Latent Space [61.091910046492345]
$lambda$-ECLIPSE works in the latent space of a pre-trained CLIP model without relying on the diffusion UNet models.
$lambda$-ECLIPSE performs multisubject driven P-T2I with just 34M parameters and is trained on a mere 74 GPU hours.
arXiv Detail & Related papers (2024-02-07T19:07:10Z) - Algorithm and Hardware Co-Design of Energy-Efficient LSTM Networks for
Video Recognition with Hierarchical Tucker Tensor Decomposition [22.502146009817416]
Long short-term memory (LSTM) is a powerful deep neural network that has been widely used in sequence analysis and modeling applications.
In this paper, we propose to perform algorithm and hardware co-design towards high-performance energy-efficient LSTM networks.
arXiv Detail & Related papers (2022-12-05T05:51:56Z) - A journey in ESN and LSTM visualisations on a language task [77.34726150561087]
We trained ESNs and LSTMs on a Cross-Situationnal Learning (CSL) task.
The results are of three kinds: performance comparison, internal dynamics analyses and visualization of latent space.
arXiv Detail & Related papers (2020-12-03T08:32:01Z) - Multi-view Frequency LSTM: An Efficient Frontend for Automatic Speech
Recognition [4.753402561130792]
We present a simple and efficient modification by combining the outputs of multiple FLSTM stacks with different views.
We show that the multi-view FLSTM acoustic model provides relative Word Error Rate (WER) improvements of 3-7% for different speaker and acoustic environment scenarios.
arXiv Detail & Related papers (2020-06-30T22:19:53Z) - Sentiment Analysis Using Simplified Long Short-term Memory Recurrent
Neural Networks [1.5146765382501612]
We perform sentiment analysis on a GOP Debate Twitter dataset.
To speed up training and reduce the computational cost and time, six different parameter reduced slim versions of the LSTM model are proposed.
arXiv Detail & Related papers (2020-05-08T12:50:10Z) - Convolutional Tensor-Train LSTM for Spatio-temporal Learning [116.24172387469994]
We propose a higher-order LSTM model that can efficiently learn long-term correlations in the video sequence.
This is accomplished through a novel tensor train module that performs prediction by combining convolutional features across time.
Our results achieve state-of-the-art performance-art in a wide range of applications and datasets.
arXiv Detail & Related papers (2020-02-21T05:00:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.