Rapid Salient Object Detection with Difference Convolutional Neural Networks
- URL: http://arxiv.org/abs/2507.01182v1
- Date: Tue, 01 Jul 2025 20:41:05 GMT
- Title: Rapid Salient Object Detection with Difference Convolutional Neural Networks
- Authors: Zhuo Su, Li Liu, Matthias Müller, Jiehua Zhang, Diana Wofk, Ming-Ming Cheng, Matti Pietikäinen,
- Abstract summary: This paper addresses the challenge of deploying salient object detection (SOD) on resource-constrained devices with real-time performance.<n>We propose an efficient network design that combines traditional wisdom on SOD and the representation power of modern CNNs.
- Score: 49.838283141381716
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper addresses the challenge of deploying salient object detection (SOD) on resource-constrained devices with real-time performance. While recent advances in deep neural networks have improved SOD, existing top-leading models are computationally expensive. We propose an efficient network design that combines traditional wisdom on SOD and the representation power of modern CNNs. Like biologically-inspired classical SOD methods relying on computing contrast cues to determine saliency of image regions, our model leverages Pixel Difference Convolutions (PDCs) to encode the feature contrasts. Differently, PDCs are incorporated in a CNN architecture so that the valuable contrast cues are extracted from rich feature maps. For efficiency, we introduce a difference convolution reparameterization (DCR) strategy that embeds PDCs into standard convolutions, eliminating computation and parameters at inference. Additionally, we introduce SpatioTemporal Difference Convolution (STDC) for video SOD, enhancing the standard 3D convolution with spatiotemporal contrast capture. Our models, SDNet for image SOD and STDNet for video SOD, achieve significant improvements in efficiency-accuracy trade-offs. On a Jetson Orin device, our models with $<$ 1M parameters operate at 46 FPS and 150 FPS on streamed images and videos, surpassing the second-best lightweight models in our experiments by more than $2\times$ and $3\times$ in speed with superior accuracy. Code will be available at https://github.com/hellozhuo/stdnet.git.
Related papers
- An Efficient 3D Convolutional Neural Network with Channel-wise, Spatial-grouped, and Temporal Convolutions [3.798710743290466]
We introduce a simple and very efficient 3D convolutional neural network for video action recognition.<n>We evaluate the performance and efficiency of our proposed network on several video action recognition datasets.
arXiv Detail & Related papers (2025-03-02T08:47:06Z) - ApproxDARTS: Differentiable Neural Architecture Search with Approximate Multipliers [0.24578723416255746]
We present ApproxDARTS, a neural architecture search (NAS) method enabling the popular differentiable neural architecture search method called DARTS to exploit approximate multipliers.
We show that the ApproxDARTS is able to perform a complete architecture search within less than $10$ GPU hours and produce competitive convolutional neural networks (CNN) containing approximate multipliers in convolutional layers.
arXiv Detail & Related papers (2024-04-08T09:54:57Z) - PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm [111.16358607889609]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.<n>For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z) - Scalable Neural Video Representations with Learnable Positional Features [73.51591757726493]
We show how to train neural representations with learnable positional features (NVP) that effectively amortize a video as latent codes.
We demonstrate the superiority of NVP on the popular UVG benchmark; compared with prior arts, NVP not only trains 2 times faster (less than 5 minutes) but also exceeds their encoding quality as 34.07rightarrow$34.57 (measured with the PSNR metric)
arXiv Detail & Related papers (2022-10-13T08:15:08Z) - In Defense of Image Pre-Training for Spatiotemporal Recognition [32.56468478601864]
Key to effectively leveraging image pre-training lies in the decomposition of learning spatial and temporal features.
New pipeline consistently achieves better results on video recognition with significant speedup.
arXiv Detail & Related papers (2022-05-03T18:45:44Z) - 3D CNNs with Adaptive Temporal Feature Resolutions [83.43776851586351]
Similarity Guided Sampling (SGS) module can be plugged into any existing 3D CNN architecture.
SGS empowers 3D CNNs by learning the similarity of temporal features and grouping similar features together.
Our evaluations show that the proposed module improves the state-of-the-art by reducing the computational cost (GFLOPs) by half while preserving or even improving the accuracy.
arXiv Detail & Related papers (2020-11-17T14:34:05Z) - RT3D: Achieving Real-Time Execution of 3D Convolutional Neural Networks
on Mobile Devices [57.877112704841366]
This paper proposes RT3D, a model compression and mobile acceleration framework for 3D CNNs.
For the first time, real-time execution of 3D CNNs is achieved on off-the-shelf mobiles.
arXiv Detail & Related papers (2020-07-20T02:05:32Z) - Disp R-CNN: Stereo 3D Object Detection via Shape Prior Guided Instance
Disparity Estimation [51.17232267143098]
We propose a novel system named Disp R-CNN for 3D object detection from stereo images.
We use a statistical shape model to generate dense disparity pseudo-ground-truth without the need of LiDAR point clouds.
Experiments on the KITTI dataset show that, even when LiDAR ground-truth is not available at training time, Disp R-CNN achieves competitive performance and outperforms previous state-of-the-art methods by 20% in terms of average precision.
arXiv Detail & Related papers (2020-04-07T17:48:45Z) - FADNet: A Fast and Accurate Network for Disparity Estimation [18.05392578461659]
We propose an efficient and accurate deep network for disparity estimation named FADNet.
It exploits efficient 2D based correlation layers with stacked blocks to preserve fast computation.
It contains multi-scale predictions so as to exploit a multi-scale weight scheduling training technique to improve the accuracy.
arXiv Detail & Related papers (2020-03-24T10:27:11Z) - STH: Spatio-Temporal Hybrid Convolution for Efficient Action Recognition [39.58542259261567]
We present a novel S-Temporal Hybrid Network (STH) which simultaneously encodes spatial and temporal video information with a small parameter.
Such a design enables efficient-temporal modeling and maintains a small model scale.
STH enjoys performance superiority over 3D CNNs while maintaining an even smaller parameter cost than 2D CNNs.
arXiv Detail & Related papers (2020-03-18T04:46:30Z) - ZoomNet: Part-Aware Adaptive Zooming Neural Network for 3D Object
Detection [69.68263074432224]
We present a novel framework named ZoomNet for stereo imagery-based 3D detection.
The pipeline of ZoomNet begins with an ordinary 2D object detection model which is used to obtain pairs of left-right bounding boxes.
To further exploit the abundant texture cues in RGB images for more accurate disparity estimation, we introduce a conceptually straight-forward module -- adaptive zooming.
arXiv Detail & Related papers (2020-03-01T17:18:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.