Related papers: O2Former:Direction-Aware and Multi-Scale Query Enhancement for SAR Ship Instance Segmentation

O2Former:Direction-Aware and Multi-Scale Query Enhancement for SAR Ship Instance Segmentation

URL: http://arxiv.org/abs/2506.11913v1
Date: Fri, 13 Jun 2025 16:06:51 GMT
Title: O2Former:Direction-Aware and Multi-Scale Query Enhancement for SAR Ship Instance Segmentation
Authors: F. Gao, Y Li, X He, J Sun, J Wang,
Abstract summary: Instance segmentation of ships in synthetic aperture radar (SAR) imagery is critical for applications such as maritime monitoring, environmental analysis, and national security.<n> SAR ship images present challenges including scale variation, object density, and fuzzy target boundary.<n>We propose O2Former, a tailored instance segmentation framework that extends Mask2Former by fully leveraging the structural characteristics of SAR imagery.
Score: 0.3611754783778107
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Instance segmentation of ships in synthetic aperture radar (SAR) imagery is critical for applications such as maritime monitoring, environmental analysis, and national security. SAR ship images present challenges including scale variation, object density, and fuzzy target boundary, which are often overlooked in existing methods, leading to suboptimal performance. In this work, we propose O2Former, a tailored instance segmentation framework that extends Mask2Former by fully leveraging the structural characteristics of SAR imagery. We introduce two key components. The first is the Optimized Query Generator(OQG). It enables multi-scale feature interaction by jointly encoding shallow positional cues and high-level semantic information. This improves query quality and convergence efficiency. The second component is the Orientation-Aware Embedding Module(OAEM). It enhances directional sensitivity through direction-aware convolution and polar-coordinate encoding. This effectively addresses the challenge of uneven target orientations in SAR scenes. Together, these modules facilitate precise feature alignment from backbone to decoder and strengthen the model's capacity to capture fine-grained structural details. Extensive experiments demonstrate that O2Former outperforms state of the art instance segmentation baselines, validating its effectiveness and generalization on SAR ship datasets.

Related papers

Dynamic Topology Awareness: Breaking the Granularity Rigidity in Vision-Language Navigation [22.876516699004814]
Vision-Language Navigation in Continuous Environments (VLN-CE) presents a core challenge: grounding high-level linguistic instructions into precise, safe, and long-horizon spatial actions.<n>Explicit topological maps have proven to be a vital solution for providing robust spatial memory in such tasks.<n>Existing topological planning methods suffer from a "Granularity Rigidity" problem.<n>We propose DGNav, a framework for Dynamic Topological Navigation, introducing a context-aware mechanism to modulate map density and connectivity on-the-fly.
arXiv Detail & Related papers (2026-01-29T14:06:23Z)
Re-Densification Meets Cross-Scale Propagation: Real-Time Neural Compression of LiDAR Point Clouds [83.39320394656855]
LiDAR point clouds are fundamental to various applications, yet high-precision scans incur substantial storage and transmission overhead.<n>Existing methods typically convert unordered points into hierarchical octree or voxel structures for dense-to-sparse predictive coding.<n>Our framework comprises two lightweight modules. First, the Geometry Re-Densification Module re-densifies encoded sparse geometry, extracts features at denser scale, and then re-sparsifies the features for predictive coding.
arXiv Detail & Related papers (2025-08-28T06:36:10Z)
GCRPNet: Graph-Enhanced Contextual and Regional Perception Network for Salient Object Detection in Optical Remote Sensing Images [68.33481681452675]
We propose a graph-enhanced contextual and regional perception network (GCRPNet)<n>It builds upon the Mamba architecture to simultaneously capture long-range dependencies and enhance regional feature representation.<n>It performs adaptive patch scanning on feature maps processed via multi-scale convolutions, thereby capturing rich local region information.
arXiv Detail & Related papers (2025-08-14T11:31:43Z)
SAM2-UNeXT: An Improved High-Resolution Baseline for Adapting Foundation Models to Downstream Segmentation Tasks [50.97089872043121]
We propose SAM2-UNeXT, an advanced framework that builds upon the core principles of SAM2-UNet.<n>We extend the representational capacity of SAM2 through the integration of an auxiliary DINOv2 encoder.<n>Our approach enables more accurate segmentation with a simple architecture, relaxing the need for complex decoder designs.
arXiv Detail & Related papers (2025-08-05T15:36:13Z)
CFMD: Dynamic Cross-layer Feature Fusion for Salient Object Detection [7.262250906929891]
Cross-layer feature pyramid networks (CFPNs) have achieved notable progress in multi-scale feature fusion and boundary detail preservation for salient object detection.<n>To address these challenges, we propose CFMD, a novel cross-layer feature pyramid network that introduces two key innovations.<n>First, we design a context-aware feature aggregation module (CFLMA), which incorporates the state-of-the-art Mamba architecture to construct a dynamic weight distribution mechanism.<n>Second, we introduce an adaptive dynamic upsampling unit (CFLMD) that preserves spatial details during resolution recovery.
arXiv Detail & Related papers (2025-04-02T03:22:36Z)
FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion [63.87313550399871]
Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability.<n>We propose Self-supervised Transfer (PST) and FrequencyDe-coupled Fusion module (FreDF)<n>PST establishes cross-modal knowledge transfer through latent space alignment with image foundation models.<n>FreDF explicitly decouples high-frequency edge features from low-frequency structural components, resolving modality-specific frequency mismatches.
arXiv Detail & Related papers (2025-03-25T15:04:53Z)
Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation [50.433911327489554]
The goal of referring remote sensing image segmentation (RRSIS) is to generate a pixel-level mask of the target object identified by the referring expression.<n>To address the aforementioned challenges, a novel RRSIS framework is proposed, termed the cross-modal bidirectional interaction model (CroBIM)<n>To further forster the research of RRSIS, we also construct RISBench, a new large-scale benchmark dataset comprising 52,472 image-language-label triplets.
arXiv Detail & Related papers (2024-10-11T08:28:04Z)
Embracing Events and Frames with Hierarchical Feature Refinement Network for Object Detection [17.406051477690134]
Event cameras output sparse and asynchronous events, providing a potential solution to solve these problems. We propose a novel hierarchical feature refinement network for event-frame fusion. Our method exhibits significantly better robustness when introducing 15 different corruption types to the frame images.
arXiv Detail & Related papers (2024-07-17T14:09:46Z)
GRA: Detecting Oriented Objects through Group-wise Rotating and Attention [64.21917568525764]
Group-wise Rotating and Attention (GRA) module is proposed to replace the convolution operations in backbone networks for oriented object detection. GRA can adaptively capture fine-grained features of objects with diverse orientations, comprising two key components: Group-wise Rotating and Group-wise Attention. GRA achieves a new state-of-the-art (SOTA) on the DOTA-v2.0 benchmark, while saving the parameters by nearly 50% compared to the previous SOTA method.
arXiv Detail & Related papers (2024-03-17T07:29:32Z)
Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing. Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery. We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z)
Salient Object Detection in Optical Remote Sensing Images Driven by Transformer [69.22039680783124]
We propose a novel Global Extraction Local Exploration Network (GeleNet) for Optical Remote Sensing Images (ORSI-SOD) Specifically, GeleNet first adopts a transformer backbone to generate four-level feature embeddings with global long-range dependencies. Extensive experiments on three public datasets demonstrate that the proposed GeleNet outperforms relevant state-of-the-art methods.
arXiv Detail & Related papers (2023-09-15T07:14:43Z)
A Mask Attention Interaction and Scale Enhancement Network for SAR Ship Instance Segmentation [4.232332676611087]
We propose a mask attention interaction and scale enhancement network (MAI-SE-Net) for SAR ship instance segmentation. MAI uses an atrous spatial pyra-mid pooling (ASPP) to gain multi-resolution feature re-sponses, a non-local block (NLB) to model long-range spa-tial dependencies, and a concatenation shuffle attention block (CSAB) to improve interaction benefits.
arXiv Detail & Related papers (2022-07-08T14:04:04Z)
Dense Attention Fluid Network for Salient Object Detection in Optical Remote Sensing Images [193.77450545067967]
We propose an end-to-end Dense Attention Fluid Network (DAFNet) for salient object detection in optical remote sensing images (RSIs) A Global Context-aware Attention (GCA) module is proposed to adaptively capture long-range semantic context relationships. We construct a new and challenging optical RSI dataset for SOD that contains 2,000 images with pixel-wise saliency annotations.
arXiv Detail & Related papers (2020-11-26T06:14:10Z)
Align Deep Features for Oriented Object Detection [40.28244152216309]
We propose a single-shot Alignment Network (S$2$A-Net) consisting of two modules: a Feature Alignment Module (FAM) and an Oriented Detection Module (ODM) The FAM can generate high-quality anchors with an Anchor Refinement Network and adaptively align the convolutional features according to the anchor boxes with a novel Alignment Convolution. The ODM first adopts active rotating filters to encode the orientation information and then produces orientation-sensitive and orientation-invariant features to alleviate the inconsistency between classification score and localization accuracy.
arXiv Detail & Related papers (2020-08-21T09:55:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.