Related papers: MS-LSTM: Exploring Spatiotemporal Multiscale Representations in Video Prediction Domain

MS-LSTM: Exploring Spatiotemporal Multiscale Representations in Video Prediction Domain

URL: http://arxiv.org/abs/2304.07724v3
Date: Fri, 16 Feb 2024 07:11:05 GMT
Title: MS-LSTM: Exploring Spatiotemporal Multiscale Representations in Video Prediction Domain
Authors: Zhifeng Ma, Hao Zhang, Jie Liu
Abstract summary: Existing RNN models obtain the multi-scale of features only by stacking layers. This paper proposes MS-LSTM wholly from a multi-scale perspective. We theoretically analyze the training cost and performance of MS-LSTM and its components.
Score: 8.216911980865902
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The drastic variation of motion in spatial and temporal dimensions makes the video prediction task extremely challenging. Existing RNN models obtain higher performance by deepening or widening the model. They obtain the multi-scale features of the video only by stacking layers, which is inefficient and brings unbearable training costs (such as memory, FLOPs, and training time). Different from them, this paper proposes a spatiotemporal multi-scale model called MS-LSTM wholly from a multi-scale perspective. On the basis of stacked layers, MS-LSTM incorporates two additional efficient multi-scale designs to fully capture spatiotemporal context information. Concretely, we employ LSTMs with mirrored pyramid structures to construct spatial multi-scale representations and LSTMs with different convolution kernels to construct temporal multi-scale representations. We theoretically analyze the training cost and performance of MS-LSTM and its components. Detailed comparison experiments with twelve baseline models on four video datasets show that MS-LSTM has better performance but lower training costs.

Related papers

LLaVA-KD: A Framework of Distilling Multimodal Large Language Models [70.19607283302712]
We propose a novel framework to transfer knowledge from l-MLLM to s-MLLM. Specifically, we introduce Multimodal Distillation (MDist) to minimize the divergence between the visual-textual output distributions of l-MLLM and s-MLLM. We also propose a three-stage training scheme to fully exploit the potential of s-MLLM.
arXiv Detail & Related papers (2024-10-21T17:41:28Z)
EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical Alignment [39.870809905905325]
We propose Empowering Multi-modal Mamba with Structural and Hierarchical Alignment (EMMA) to extract fine-grained visual information. Our model shows lower latency than other Mamba-based MLLMs and is nearly four times faster than transformer-based MLLMs of similar scale during inference.
arXiv Detail & Related papers (2024-10-08T11:41:55Z)
NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks. We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z)
xLSTM-UNet can be an Effective 2D & 3D Medical Image Segmentation Backbone with Vision-LSTM (ViL) better than its Mamba Counterpart [13.812935743270517]
We propose xLSTM-UNet, a UNet structured deep learning neural network that leverages Vision-LSTM (xLSTM) as its backbone for medical image segmentation. xLSTM is a recently proposed as the successor of Long Short-Term Memory (LSTM) networks. Our findings demonstrate that xLSTM-UNet consistently surpasses the performance of leading CNN-based, Transformer-based, and Mamba-based segmentation networks.
arXiv Detail & Related papers (2024-07-01T17:59:54Z)
LTSM-Bundle: A Toolbox and Benchmark on Large Language Models for Time Series Forecasting [69.33802286580786]
We introduce LTSM-Bundle, a comprehensive toolbox, and benchmark for training LTSMs. It modularized and benchmarked LTSMs from multiple dimensions, encompassing prompting strategies, tokenization approaches, base model selection, data quantity, and dataset diversity. Empirical results demonstrate that this combination achieves superior zero-shot and few-shot performances compared to state-of-the-art LTSMs and traditional TSF methods.
arXiv Detail & Related papers (2024-06-20T07:09:19Z)
ST-LLM: Large Language Models Are Effective Temporal Learners [58.79456373423189]
Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation. How to effectively encode and understand videos in video-based dialogue systems remains to be solved. We propose ST-LLM, an effective video-LLM baseline with spatial-temporal sequence modeling inside LLM.
arXiv Detail & Related papers (2024-03-30T10:11:26Z)
$λ$-ECLIPSE: Multi-Concept Personalized Text-to-Image Diffusion Models by Leveraging CLIP Latent Space [61.091910046492345]
$lambda$-ECLIPSE works in the latent space of a pre-trained CLIP model without relying on the diffusion UNet models. $lambda$-ECLIPSE performs multisubject driven P-T2I with just 34M parameters and is trained on a mere 74 GPU hours.
arXiv Detail & Related papers (2024-02-07T19:07:10Z)
Algorithm and Hardware Co-Design of Energy-Efficient LSTM Networks for Video Recognition with Hierarchical Tucker Tensor Decomposition [22.502146009817416]
Long short-term memory (LSTM) is a powerful deep neural network that has been widely used in sequence analysis and modeling applications. In this paper, we propose to perform algorithm and hardware co-design towards high-performance energy-efficient LSTM networks.
arXiv Detail & Related papers (2022-12-05T05:51:56Z)
A journey in ESN and LSTM visualisations on a language task [77.34726150561087]
We trained ESNs and LSTMs on a Cross-Situationnal Learning (CSL) task. The results are of three kinds: performance comparison, internal dynamics analyses and visualization of latent space.
arXiv Detail & Related papers (2020-12-03T08:32:01Z)
Multi-view Frequency LSTM: An Efficient Frontend for Automatic Speech Recognition [4.753402561130792]
We present a simple and efficient modification by combining the outputs of multiple FLSTM stacks with different views. We show that the multi-view FLSTM acoustic model provides relative Word Error Rate (WER) improvements of 3-7% for different speaker and acoustic environment scenarios.
arXiv Detail & Related papers (2020-06-30T22:19:53Z)
Sentiment Analysis Using Simplified Long Short-term Memory Recurrent Neural Networks [1.5146765382501612]
We perform sentiment analysis on a GOP Debate Twitter dataset. To speed up training and reduce the computational cost and time, six different parameter reduced slim versions of the LSTM model are proposed.
arXiv Detail & Related papers (2020-05-08T12:50:10Z)
Convolutional Tensor-Train LSTM for Spatio-temporal Learning [116.24172387469994]
We propose a higher-order LSTM model that can efficiently learn long-term correlations in the video sequence. This is accomplished through a novel tensor train module that performs prediction by combining convolutional features across time. Our results achieve state-of-the-art performance-art in a wide range of applications and datasets.
arXiv Detail & Related papers (2020-02-21T05:00:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.