Related papers: P-MSDiff: Parallel Multi-Scale Diffusion for Remote Sensing Image Segmentation

P-MSDiff: Parallel Multi-Scale Diffusion for Remote Sensing Image Segmentation

URL: http://arxiv.org/abs/2405.20443v2
Date: Wed, 24 Jul 2024 07:34:35 GMT
Title: P-MSDiff: Parallel Multi-Scale Diffusion for Remote Sensing Image Segmentation
Authors: Qi Zhang, Guohua Geng, Longquan Yan, Pengbo Zhou, Zhaodi Li, Kang Li, Qinglin Liu,
Abstract summary: Diffusion models and multi-scale features are essential components in semantic segmentation tasks. We propose a new model for semantic segmentation known as the diffusion model with parallel multi-scale branches. Our model demonstrates superior performance based on the J1 metric on both the UAVid and Vaihingen Building datasets.
Score: 8.46409964236009
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion models and multi-scale features are essential components in semantic segmentation tasks that deal with remote-sensing images. They contribute to improved segmentation boundaries and offer significant contextual information. U-net-like architectures are frequently employed in diffusion models for segmentation tasks. These architectural designs include dense skip connections that may pose challenges for interpreting intermediate features. Consequently, they might not efficiently convey semantic information throughout various layers of the encoder-decoder architecture. To address these challenges, we propose a new model for semantic segmentation known as the diffusion model with parallel multi-scale branches. This model consists of Parallel Multiscale Diffusion modules (P-MSDiff) and a Cross-Bridge Linear Attention mechanism (CBLA). P-MSDiff enhances the understanding of semantic information across multiple levels of granularity and detects repetitive distribution data through the integration of recursive denoising branches. It further facilitates the amalgamation of data by connecting relevant branches to the primary framework to enable concurrent denoising. Furthermore, within the interconnected transformer architecture, the LA module has been substituted with the CBLA module. This module integrates a semidefinite matrix linked to the query into the dot product computation of keys and values. This integration enables the adaptation of queries within the LA framework. This adjustment enhances the structure for multi-head attention computation, leading to enhanced network performance and CBLA is a plug-and-play module. Our model demonstrates superior performance based on the J1 metric on both the UAVid and Vaihingen Building datasets, showing improvements of 1.60% and 1.40% over strong baseline models, respectively.

Related papers

ParaUni: Enhance Generation in Unified Multimodal Model with Reinforcement-driven Hierarchical Parallel Information Interaction [55.21514454560188]
Unified multimodal models significantly improve visual generation by combining vision-grained models (VLMs) with diffusion models.<n>Existing methods struggle to fully balance sufficient interaction and flexible implementation due to vast representation difference.<n>We propose textbfParaUni, which extracts features from variants VLM's layers in a textbfParallel way for comprehensive information interaction.
arXiv Detail & Related papers (2025-12-05T04:41:57Z)
Complementary and Contrastive Learning for Audio-Visual Segmentation [74.11434759171199]
We present Complementary and Contrastive Transformer (CCFormer), a novel framework adept at processing both local and global information.<n>Our method sets new state-of-the-art benchmarks across the S4, MS3 and AVSS datasets.
arXiv Detail & Related papers (2025-10-11T06:36:59Z)
Every Step Counts: Decoding Trajectories as Authorship Fingerprints of dLLMs [63.82840470917859]
We show that the decoding mechanism of dLLMs can be used as a powerful tool for model attribution.<n>We propose a novel information extraction scheme called the Directed Decoding Map (DDM), which captures structural relationships between decoding steps and better reveals model-specific behaviors.
arXiv Detail & Related papers (2025-10-02T06:25:10Z)
An Enhanced Pyramid Feature Network Based on Long-Range Dependencies for Multi-Organ Medical Image Segmentation [12.785846392087144]
We propose a new deep-learning network called LamFormer for fine-grained segmentation tasks across multiple organs.<n>LamFormer outperforms existing segmentation methods on seven complex and diverse datasets.
arXiv Detail & Related papers (2025-09-29T06:57:11Z)
AMMNet: An Asymmetric Multi-Modal Network for Remote Sensing Semantic Segmentation [11.92569805944134]
Asymmetric Multi-Modal Network (AMMNet) is a novel asymmetric architecture that achieves robust segmentation through three designs tailored for RGB-DSM input pairs.<n>AMMNet attains state-of-the-art segmentation accuracy among multi-modal networks while reducing computational and memory requirements.
arXiv Detail & Related papers (2025-07-22T02:07:19Z)
Structural Similarity-Inspired Unfolding for Lightweight Image Super-Resolution [88.20464308588889]
We propose a Structural Similarity-Inspired Unfolding (SSIU) method for efficient image SR.<n>This method is designed through unfolding an SR optimization function constrained by structural similarity.<n>Our model outperforms current state-of-the-art models, boasting lower parameter counts and reduced memory consumption.
arXiv Detail & Related papers (2025-06-13T14:29:40Z)
BIMII-Net: Brain-Inspired Multi-Iterative Interactive Network for RGB-T Road Scene Semantic Segmentation [6.223341988991549]
We propose a novel RGB-T road scene semantic segmentation network called Brain-Inspired Multi-Iteration Interaction Network ( BIMII-Net) First, to meet the requirements of accurate texture and local information extraction in road scenarios like autonomous driving, we proposed a deep continuous-coupled neural network (DCCNN) architecture based on a brain-inspired model. Second, to enhance the interaction and expression capabilities among multi-modal information, we designed a cross explicit attention-enhanced fusion module (CEAEF-Module) in the feature fusion stage of BIMII-Net. Finally, we constructed a complementary interactive multi-layer decoder
arXiv Detail & Related papers (2025-03-25T03:09:46Z)
Multimodal-Aware Fusion Network for Referring Remote Sensing Image Segmentation [7.992331117310217]
Referring remote sensing image segmentation (RRSIS) is a novel visual task in remote sensing images segmentation. We design a multimodal-aware fusion network (MAFN) to achieve fine-grained alignment and fusion between the two modalities.
arXiv Detail & Related papers (2025-03-14T08:31:21Z)
Multi-scale Feature Enhancement in Multi-task Learning for Medical Image Analysis [1.6916040234975798]
Traditional deep learning methods in medical imaging often focus solely on segmentation or classification. We propose a simple yet effective UNet-based MTL model, where features extracted by the encoder are used to predict classification labels, while the decoder produces the segmentation mask. Experimental results across multiple medical datasets confirm the superior performance of our model in both segmentation and classification tasks.
arXiv Detail & Related papers (2024-11-30T04:20:05Z)
CCExpert: Advancing MLLM Capability in Remote Sensing Change Captioning with Difference-Aware Integration and a Foundational Dataset [26.056704438848985]
We propose a novel model, CCExpert, based on a new, advanced multimodal large model framework. First, we design a difference-aware integration module to capture multi-scale differences between bi-temporal images. Secondly, we constructed a high-quality, diversified dataset called CC-Foundation, containing 200,000 image pairs and 1.2 million captions. Finally, we employed a three-stage progressive training process to ensure the deep integration of the difference-aware integration module with the pretrained MLLM.
arXiv Detail & Related papers (2024-11-18T08:10:49Z)
Brain-Inspired Stepwise Patch Merging for Vision Transformers [6.108377966393714]
We propose a novel technique called Stepwise Patch Merging (SPM), which enhances the subsequent attention mechanism's ability to'see' better. Extensive experiments conducted on benchmark datasets, including ImageNet-1K, COCO, and ADE20K, demonstrate that SPM significantly improves the performance of various models.
arXiv Detail & Related papers (2024-09-11T03:04:46Z)
Local-to-Global Cross-Modal Attention-Aware Fusion for HSI-X Semantic Segmentation [19.461033552684576]
We propose a Local-to-Global Cross-modal Attention-aware Fusion (LoGoCAF) framework for HSI-X classification. LoGoCAF adopts a pixel-to-pixel two-branch semantic segmentation architecture to learn information from HSI and X modalities.
arXiv Detail & Related papers (2024-06-25T16:12:20Z)
DiffVein: A Unified Diffusion Network for Finger Vein Segmentation and Authentication [50.017055360261665]
We introduce DiffVein, a unified diffusion model-based framework which simultaneously addresses vein segmentation and authentication tasks. For better feature interaction between these two branches, we introduce two specialized modules. In this way, our framework allows for a dynamic interplay between diffusion and segmentation embeddings.
arXiv Detail & Related papers (2024-02-03T06:49:42Z)
SIM-Trans: Structure Information Modeling Transformer for Fine-grained Visual Categorization [59.732036564862796]
We propose the Structure Information Modeling Transformer (SIM-Trans) to incorporate object structure information into transformer for enhancing discriminative representation learning. The proposed two modules are light-weighted and can be plugged into any transformer network and trained end-to-end easily. Experiments and analyses demonstrate that the proposed SIM-Trans achieves state-of-the-art performance on fine-grained visual categorization benchmarks.
arXiv Detail & Related papers (2022-08-31T03:00:07Z)
Multi-scale and Cross-scale Contrastive Learning for Semantic Segmentation [5.281694565226513]
We apply contrastive learning to enhance the discriminative power of the multi-scale features extracted by semantic segmentation networks. By first mapping the encoder's multi-scale representations to a common feature space, we instantiate a novel form of supervised local-global constraint.
arXiv Detail & Related papers (2022-03-25T01:24:24Z)
DeMFI: Deep Joint Deblurring and Multi-Frame Interpolation with Flow-Guided Attentive Correlation and Recursive Boosting [50.17500790309477]
DeMFI-Net is a joint deblurring and multi-frame framework. It converts blurry videos of lower-frame-rate to sharp videos at higher-frame-rate. It achieves state-of-the-art (SOTA) performances for diverse datasets.
arXiv Detail & Related papers (2021-11-19T00:00:15Z)
Boosting Few-shot Semantic Segmentation with Transformers [81.43459055197435]
TRansformer-based Few-shot Semantic segmentation method (TRFS) Our model consists of two modules: Global Enhancement Module (GEM) and Local Enhancement Module (LEM)
arXiv Detail & Related papers (2021-08-04T20:09:21Z)
CaEGCN: Cross-Attention Fusion based Enhanced Graph Convolutional Network for Clustering [51.62959830761789]
We propose a cross-attention based deep clustering framework, named Cross-Attention Fusion based Enhanced Graph Convolutional Network (CaEGCN) CaEGCN contains four main modules: cross-attention fusion, Content Auto-encoder, Graph Convolutional Auto-encoder and self-supervised model. Experimental results on different types of datasets prove the superiority and robustness of the proposed CaEGCN.
arXiv Detail & Related papers (2021-01-18T05:21:59Z)
Sequential Hierarchical Learning with Distribution Transformation for Image Super-Resolution [83.70890515772456]
We build a sequential hierarchical learning super-resolution network (SHSR) for effective image SR. We consider the inter-scale correlations of features, and devise a sequential multi-scale block (SMB) to progressively explore the hierarchical information. Experiment results show SHSR achieves superior quantitative performance and visual quality to state-of-the-art methods.
arXiv Detail & Related papers (2020-07-19T01:35:53Z)
Encoder-Decoder Based Convolutional Neural Networks with Multi-Scale-Aware Modules for Crowd Counting [6.893512627479196]
We propose two modified neural networks for accurate and efficient crowd counting. The first model is named M-SFANet, which is attached with atrous spatial pyramid pooling (ASPP) and context-aware module (CAN) The second model is called M-SegNet, which is produced by replacing the bilinear upsampling in SFANet with max unpooling that is used in SegNet.
arXiv Detail & Related papers (2020-03-12T03:00:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.