USAM-Net: A U-Net-based Network for Improved Stereo Correspondence and Scene Depth Estimation using Features from a Pre-trained Image Segmentation network
- URL: http://arxiv.org/abs/2503.14950v1
- Date: Wed, 19 Mar 2025 07:29:02 GMT
- Title: USAM-Net: A U-Net-based Network for Improved Stereo Correspondence and Scene Depth Estimation using Features from a Pre-trained Image Segmentation network
- Authors: Joseph Emmanuel DL Dayo, Prospero C. Naval Jr,
- Abstract summary: We introduce a novel convolutional neural network that integrates stereo image inputs with semantic segmentation maps and attention to enhance depth estimation performance.<n>USAM-Net achieves superior performance metrics, including a Global Difference (GD) of 3.61% and an End-Point Error (EPE) of 0.88.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The increasing demand for high-accuracy depth estimation in autonomous driving and augmented reality applications necessitates advanced neural architectures capable of effectively leveraging multiple data modalities. In this context, we introduce the Unified Segmentation Attention Mechanism Network (USAM-Net), a novel convolutional neural network that integrates stereo image inputs with semantic segmentation maps and attention to enhance depth estimation performance. USAM-Net employs a dual-pathway architecture, which combines a pre-trained segmentation model (SAM) and a depth estimation model. The segmentation pathway preprocesses the stereo images to generate semantic masks, which are then concatenated with the stereo images as inputs to the depth estimation pathway. This integration allows the model to focus on important features such as object boundaries and surface textures which are crucial for accurate depth perception. Empirical evaluation on the DrivingStereo dataset demonstrates that USAM-Net achieves superior performance metrics, including a Global Difference (GD) of 3.61\% and an End-Point Error (EPE) of 0.88, outperforming traditional models such as CFNet, SegStereo, and iResNet. These results underscore the effectiveness of integrating segmentation information into stereo depth estimation tasks, highlighting the potential of USAM-Net in applications demanding high-precision depth data.
Related papers
- Optimized Unet with Attention Mechanism for Multi-Scale Semantic Segmentation [8.443350618722564]
This paper proposes an improved Unet model combined with an attention mechanism.<n>It introduces channel attention and spatial attention modules, enhances the model's ability to focus on important features.<n>The improved model performs well in terms of mIoU and pixel accuracy (PA), reaching 76.5% and 95.3% respectively.
arXiv Detail & Related papers (2025-02-06T06:51:23Z) - Enhanced Encoder-Decoder Architecture for Accurate Monocular Depth Estimation [0.0]
This paper introduces a novel deep learning-based approach using an enhanced encoder-decoder architecture.<n>It incorporates multi-scale feature extraction to enhance depth prediction accuracy across various object sizes and distances.<n> Experimental results on the KITTI dataset show that our model achieves a significantly faster inference time of 0.019 seconds.
arXiv Detail & Related papers (2024-10-15T13:46:19Z) - LAC-Net: Linear-Fusion Attention-Guided Convolutional Network for Accurate Robotic Grasping Under the Occlusion [79.22197702626542]
This paper introduces a framework that explores amodal segmentation for robotic grasping in cluttered scenes.
We propose a Linear-fusion Attention-guided Convolutional Network (LAC-Net)
The results on different datasets show that our method achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-08-06T14:50:48Z) - FSNet: Redesign Self-Supervised MonoDepth for Full-Scale Depth
Prediction for Autonomous Driving [18.02943016671203]
This study proposes a comprehensive self-supervised framework for accurate scale-aware depth prediction on autonomous driving scenes.
In particular, we introduce a Full-Scale depth prediction network named FSNet.
With FSNet, robots and vehicles with only one well-calibrated camera can collect sequences of training image frames and camera poses, and infer accurate 3D depths of the environment without extra labeling work or 3D data.
arXiv Detail & Related papers (2023-04-21T03:17:04Z) - 3DVNet: Multi-View Depth Prediction and Volumetric Refinement [68.68537312256144]
3DVNet is a novel multi-view stereo (MVS) depth-prediction method.
Our key idea is the use of a 3D scene-modeling network that iteratively updates a set of coarse depth predictions.
We show that our method exceeds state-of-the-art accuracy in both depth prediction and 3D reconstruction metrics.
arXiv Detail & Related papers (2021-12-01T00:52:42Z) - MDA-Net: Multi-Dimensional Attention-Based Neural Network for 3D Image
Segmentation [4.221871357181261]
We propose a multi-dimensional attention network (MDA-Net) to efficiently integrate slice-wise, spatial, and channel-wise attention into a U-Net based network.
We evaluate our model on the MICCAI iSeg and IBSR datasets, and the experimental results demonstrate consistent improvements over existing methods.
arXiv Detail & Related papers (2021-05-10T16:58:34Z) - PLADE-Net: Towards Pixel-Level Accuracy for Self-Supervised Single-View
Depth Estimation with Neural Positional Encoding and Distilled Matting Loss [49.66736599668501]
We propose a self-supervised single-view pixel-level accurate depth estimation network, called PLADE-Net.
Our method shows unprecedented accuracy levels, exceeding 95% in terms of the $delta1$ metric on the KITTI dataset.
arXiv Detail & Related papers (2021-03-12T15:54:46Z) - ES-Net: An Efficient Stereo Matching Network [4.8986598953553555]
Existing stereo matching networks typically use slow and computationally expensive 3D convolutions to improve the performance.
We propose the Efficient Stereo Network (ESNet), which achieves high performance and efficient inference at the same time.
arXiv Detail & Related papers (2021-03-05T20:11:39Z) - Multi-view Depth Estimation using Epipolar Spatio-Temporal Networks [87.50632573601283]
We present a novel method for multi-view depth estimation from a single video.
Our method achieves temporally coherent depth estimation results by using a novel Epipolar Spatio-Temporal (EST) transformer.
To reduce the computational cost, inspired by recent Mixture-of-Experts models, we design a compact hybrid network.
arXiv Detail & Related papers (2020-11-26T04:04:21Z) - SideInfNet: A Deep Neural Network for Semi-Automatic Semantic
Segmentation with Side Information [83.03179580646324]
This paper proposes a novel deep neural network architecture, namely SideInfNet.
It integrates features learnt from images with side information extracted from user annotations.
To evaluate our method, we applied the proposed network to three semantic segmentation tasks and conducted extensive experiments on benchmark datasets.
arXiv Detail & Related papers (2020-02-07T06:10:54Z) - Depthwise Non-local Module for Fast Salient Object Detection Using a
Single Thread [136.2224792151324]
We propose a new deep learning algorithm for fast salient object detection.
The proposed algorithm achieves competitive accuracy and high inference efficiency simultaneously with a single CPU thread.
arXiv Detail & Related papers (2020-01-22T15:23:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.