Related papers: Seg-LSTM: Performance of xLSTM for Semantic Segmentation of Remotely Sensed Images

Seg-LSTM: Performance of xLSTM for Semantic Segmentation of Remotely Sensed Images

URL: http://arxiv.org/abs/2406.14086v1
Date: Thu, 20 Jun 2024 08:01:28 GMT
Title: Seg-LSTM: Performance of xLSTM for Semantic Segmentation of Remotely Sensed Images
Authors: Qinfeng Zhu, Yuanzhi Cai, Lei Fan,
Abstract summary: This study is the first attempt to evaluate the effectiveness of Vision-LSTM in the semantic segmentation of remotely sensed images. Our study found that Vision-LSTM's performance in semantic segmentation was limited and generally inferior to Vision-Transformers-based and Vision-Mamba-based models in most comparative tests.
Score: 1.5954224931801726
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advancements in autoregressive networks with linear complexity have driven significant research progress, demonstrating exceptional performance in large language models. A representative model is the Extended Long Short-Term Memory (xLSTM), which incorporates gating mechanisms and memory structures, performing comparably to Transformer architectures in long-sequence language tasks. Autoregressive networks such as xLSTM can utilize image serialization to extend their application to visual tasks such as classification and segmentation. Although existing studies have demonstrated Vision-LSTM's impressive results in image classification, its performance in image semantic segmentation remains unverified. Our study represents the first attempt to evaluate the effectiveness of Vision-LSTM in the semantic segmentation of remotely sensed images. This evaluation is based on a specifically designed encoder-decoder architecture named Seg-LSTM, and comparisons with state-of-the-art segmentation networks. Our study found that Vision-LSTM's performance in semantic segmentation was limited and generally inferior to Vision-Transformers-based and Vision-Mamba-based models in most comparative tests. Future research directions for enhancing Vision-LSTM are recommended. The source code is available from https://github.com/zhuqinfeng1999/Seg-LSTM.

Related papers

The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer [68.71557348281007]
This paper introduces SAIL, a single transformer unified multimodal large language model (MLLM) Unlike existing modular MLLMs, which rely on a pre-trained vision transformer (ViT), SAIL eliminates the need for a separate vision encoder. We systematically compare SAIL's properties-including scalability, cross-modal information flow patterns, and visual representation capabilities-with those of modular MLLMs.
arXiv Detail & Related papers (2025-04-14T17:50:20Z)
DAMamba: Vision State Space Model with Dynamic Adaptive Scan [51.81060691414399]
State space models (SSMs) have recently garnered significant attention in computer vision. We propose Dynamic Adaptive Scan (DAS), a data-driven method that adaptively allocates scanning orders and regions. Based on DAS, we propose the vision backbone DAMamba, which significantly outperforms current state-of-the-art vision Mamba models in vision tasks.
arXiv Detail & Related papers (2025-02-18T08:12:47Z)
MAL: Cluster-Masked and Multi-Task Pretraining for Enhanced xLSTM Vision Performance [2.45239928345171]
We introduce MAL (Cluster-Masked and Multi-Task Pretraining for Enhanced xLSTM Vision Performance), a novel framework that enhances xLSTM's capabilities through innovative pretraining strategies. We propose a cluster-masked masking method that significantly improves local feature capture and optimize image scanning efficiency. Our universal encoder-decoder pretraining approach integrates multiple tasks, including image autoregression, depth estimation, and image segmentation, thereby enhancing the model's adaptability and robustness across diverse visual tasks.
arXiv Detail & Related papers (2024-12-14T07:58:24Z)
Heuristical Comparison of Vision Transformers Against Convolutional Neural Networks for Semantic Segmentation on Remote Sensing Imagery [0.0]
Vision Transformers (ViT) have recently brought a new wave of research in the field of computer vision. This paper focuses on the comparison of three key factors of using (or not using) ViT for semantic segmentation of remote sensing aerial images on the iSAID.
arXiv Detail & Related papers (2024-11-14T00:18:04Z)
xLSTM-UNet can be an Effective 2D & 3D Medical Image Segmentation Backbone with Vision-LSTM (ViL) better than its Mamba Counterpart [13.812935743270517]
We propose xLSTM-UNet, a UNet structured deep learning neural network that leverages Vision-LSTM (xLSTM) as its backbone for medical image segmentation. xLSTM is a recently proposed as the successor of Long Short-Term Memory (LSTM) networks. Our findings demonstrate that xLSTM-UNet consistently surpasses the performance of leading CNN-based, Transformer-based, and Mamba-based segmentation networks.
arXiv Detail & Related papers (2024-07-01T17:59:54Z)
Are Vision xLSTM Embedded UNet More Reliable in Medical 3D Image Segmentation? [3.1777394653936937]
This paper investigates the integration of CNNs and Vision Extended Long Short-Term Memory (Vision-xLSTM) models by introducing a novel approach called UVixLSTM. The Vision-xLSTM blocks captures temporal and global relationships within the patches extracted from the CNN feature maps. UVixLSTM exhibits superior performance compared to state-of-the-art networks on the publicly-available dataset.
arXiv Detail & Related papers (2024-06-24T08:01:05Z)
Efficient Visual State Space Model for Image Deblurring [83.57239834238035]
Convolutional neural networks (CNNs) and Vision Transformers (ViTs) have achieved excellent performance in image restoration. We propose a simple yet effective visual state space model (EVSSM) for image deblurring.
arXiv Detail & Related papers (2024-05-23T09:13:36Z)
PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model [49.80313655590392]
PSALM is a powerful extension of the Large Multi-modal Model (LMM) to address the segmentation task challenges. It incorporates a mask decoder and a well-designed input schema to handle a variety of segmentation tasks. The flexible design of PSALM supports joint training across multiple datasets and tasks, leading to improved performance and task generalization.
arXiv Detail & Related papers (2024-03-21T17:50:47Z)
VM-UNet: Vision Mamba UNet for Medical Image Segmentation [2.3876474175791302]
We propose a U-shape architecture model for medical image segmentation, named Vision Mamba UNet (VM-UNet) We conduct comprehensive experiments on the ISIC17, ISIC18, and Synapse datasets, and the results indicate that VM-UNet performs competitively in medical image segmentation tasks.
arXiv Detail & Related papers (2024-02-04T13:37:21Z)
Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing. Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery. We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z)
RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation. Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal. We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z)
Boosting Few-shot Semantic Segmentation with Transformers [81.43459055197435]
TRansformer-based Few-shot Semantic segmentation method (TRFS) Our model consists of two modules: Global Enhancement Module (GEM) and Local Enhancement Module (LEM)
arXiv Detail & Related papers (2021-08-04T20:09:21Z)
Multi-Perspective LSTM for Joint Visual Representation Learning [81.21490913108835]
We present a novel LSTM cell architecture capable of learning both intra- and inter-perspective relationships available in visual sequences captured from multiple perspectives. Our architecture adopts a novel recurrent joint learning strategy that uses additional gates and memories at the cell level. We show that by using the proposed cell to create a network, more effective and richer visual representations are learned for recognition tasks.
arXiv Detail & Related papers (2021-05-06T16:44:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.