GAPNet: A Lightweight Framework for Image and Video Salient Object Detection via Granularity-Aware Paradigm
- URL: http://arxiv.org/abs/2508.07585v1
- Date: Mon, 11 Aug 2025 03:30:59 GMT
- Title: GAPNet: A Lightweight Framework for Image and Video Salient Object Detection via Granularity-Aware Paradigm
- Authors: Yu-Huan Wu, Wei Liu, Zi-Xuan Zhu, Zizhou Wang, Yong Liu, Liangli Zhen,
- Abstract summary: GAPNet is a lightweight network built on the granularity-aware paradigm for both image and video object detection.<n>Our decoder is built with granularity-aware connections which fuse high-level features of low granularity and low-level features of high granularity.<n>On top of the encoder, a self-attention module is built to learn global information, enabling accurate object localization with negligible computational cost.
- Score: 19.018378225862474
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent salient object detection (SOD) models predominantly rely on heavyweight backbones, incurring substantial computational cost and hindering their practical application in various real-world settings, particularly on edge devices. This paper presents GAPNet, a lightweight network built on the granularity-aware paradigm for both image and video SOD. We assign saliency maps of different granularities to supervise the multi-scale decoder side-outputs: coarse object locations for high-level outputs and fine-grained object boundaries for low-level outputs. Specifically, our decoder is built with granularity-aware connections which fuse high-level features of low granularity and low-level features of high granularity, respectively. To support these connections, we design granular pyramid convolution (GPC) and cross-scale attention (CSA) modules for efficient fusion of low-scale and high-scale features, respectively. On top of the encoder, a self-attention module is built to learn global information, enabling accurate object localization with negligible computational cost. Unlike traditional U-Net-based approaches, our proposed method optimizes feature utilization and semantic interpretation while applying appropriate supervision at each processing stage. Extensive experiments show that the proposed method achieves a new state-of-the-art performance among lightweight image and video SOD models. Code is available at https://github.com/yuhuan-wu/GAPNet.
Related papers
- A lightweight detector for real-time detection of remote sensing images [0.0]
DMG-YOLO is a lightweight real-time detector tailored for small object detection in remote sensing images.<n>We introduce a Dual-branch Feature Extraction (DFE) module in the backbone, which partitions feature maps into two parallel branches.<n>In the neck, we introduce the Global and Local Aggregate Feature Pyramid Network (GLAFPN) to further boost small object detection.
arXiv Detail & Related papers (2025-11-21T11:11:04Z) - PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity [39.98516860109934]
PixelRefer is a unified region-level MLLM framework that enables advanced fine-grained understanding over user-specified regions.<n>Our analysis reveals that global visual tokens contribute mainly in early LLM layers, inspiring the design of PixelRefer-Lite.<n>To facilitate fine-grained instruction tuning, we curate PixelRefer-2.2M, a high-quality object-centric instruction dataset.
arXiv Detail & Related papers (2025-10-27T17:59:32Z) - GCRPNet: Graph-Enhanced Contextual and Regional Perception Network for Salient Object Detection in Optical Remote Sensing Images [68.33481681452675]
We propose a graph-enhanced contextual and regional perception network (GCRPNet)<n>It builds upon the Mamba architecture to simultaneously capture long-range dependencies and enhance regional feature representation.<n>It performs adaptive patch scanning on feature maps processed via multi-scale convolutions, thereby capturing rich local region information.
arXiv Detail & Related papers (2025-08-14T11:31:43Z) - MGDFIS: Multi-scale Global-detail Feature Integration Strategy for Small Object Detection [10.135137525886098]
Small object detection in UAV imagery is crucial for applications such as search-and-rescue, traffic monitoring, and environmental surveillance.<n>Existing multi-scale fusion methods help, but add computational burden and blur fine details.<n>We propose a unified fusion framework that tightly couples global context with local detail to boost detection performance.
arXiv Detail & Related papers (2025-06-15T02:54:25Z) - Optimized Unet with Attention Mechanism for Multi-Scale Semantic Segmentation [8.443350618722564]
This paper proposes an improved Unet model combined with an attention mechanism.<n>It introduces channel attention and spatial attention modules, enhances the model's ability to focus on important features.<n>The improved model performs well in terms of mIoU and pixel accuracy (PA), reaching 76.5% and 95.3% respectively.
arXiv Detail & Related papers (2025-02-06T06:51:23Z) - MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning [44.497776004372724]
Multi-modal large language models (MLLMs) have made significant strides in various visual understanding tasks.
We present MG-LLaVA, an innovative MLLM that enhances the model's visual processing capabilities by incorporating a multi-granularity vision flow.
To further refine the model's object recognition abilities, we incorporate object-level features derived from bounding boxes identified by offline detectors.
arXiv Detail & Related papers (2024-06-25T17:55:11Z) - Multi-view Aggregation Network for Dichotomous Image Segmentation [76.75904424539543]
Dichotomous Image (DIS) has recently emerged towards high-precision object segmentation from high-resolution natural images.
Existing methods rely on tedious multiple encoder-decoder streams and stages to gradually complete the global localization and local refinement.
Inspired by it, we model DIS as a multi-view object perception problem and provide a parsimonious multi-view aggregation network (MVANet)
Experiments on the popular DIS-5K dataset show that our MVANet significantly outperforms state-of-the-art methods in both accuracy and speed.
arXiv Detail & Related papers (2024-04-11T03:00:00Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - LENet: Lightweight And Efficient LiDAR Semantic Segmentation Using
Multi-Scale Convolution Attention [0.0]
We propose a projection-based semantic segmentation network called LENet with an encoder-decoder structure for LiDAR-based semantic segmentation.
The encoder is composed of a novel multi-scale convolutional attention (MSCA) module with varying receptive field sizes to capture features.
We show that our proposed method is lighter, more efficient, and robust compared to state-of-the-art semantic segmentation methods.
arXiv Detail & Related papers (2023-01-11T02:51:38Z) - Lightweight Salient Object Detection in Optical Remote-Sensing Images
via Semantic Matching and Edge Alignment [61.45639694373033]
We propose a novel lightweight network for optical remote sensing images (ORSI-SOD) based on semantic matching and edge alignment, termed SeaNet.
Specifically, SeaNet includes a lightweight MobileNet-V2 for feature extraction, a dynamic semantic matching module (DSMM) for high-level features, and a portable decoder for inference.
arXiv Detail & Related papers (2023-01-07T04:33:51Z) - EDN: Salient Object Detection via Extremely-Downsampled Network [66.38046176176017]
We introduce an Extremely-Downsampled Network (EDN), which employs an extreme downsampling technique to effectively learn a global view of the whole image.
Experiments demonstrate that EDN achieves sArt performance with real-time speed.
arXiv Detail & Related papers (2020-12-24T04:23:48Z) - A Holistically-Guided Decoder for Deep Representation Learning with
Applications to Semantic Segmentation and Object Detection [74.88284082187462]
One common strategy is to adopt dilated convolutions in the backbone networks to extract high-resolution feature maps.
We propose one novel holistically-guided decoder which is introduced to obtain the high-resolution semantic-rich feature maps.
arXiv Detail & Related papers (2020-12-18T10:51:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.