Related papers: EGSA-PT:Edge-Guided Spatial Attention with Progressive Training for Monocular Depth Estimation and Segmentation of Transparent Objects

EGSA-PT:Edge-Guided Spatial Attention with Progressive Training for Monocular Depth Estimation and Segmentation of Transparent Objects

URL: http://arxiv.org/abs/2511.14970v1
Date: Tue, 18 Nov 2025 23:29:20 GMT
Title: EGSA-PT:Edge-Guided Spatial Attention with Progressive Training for Monocular Depth Estimation and Segmentation of Transparent Objects
Authors: Gbenga Omotara, Ramy Farag, Seyed Mohamad Ali Tousi, G. N. DeSouza,
Abstract summary: We introduce Edge-Guided Spatial Attention (EGSA), a fusion mechanism designed to mitigate destructive interactions.<n>On both Syn-TODD and ClearPose benchmarks, EGSA consistently improved depth accuracy over the current state of the art method.<n>Our second contribution is a multi-modal progressive training strategy, where learning transitions from edges derived from RGB images to edges derived from predicted depth images.
Score: 3.6327828943194937
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transparent object perception remains a major challenge in computer vision research, as transparency confounds both depth estimation and semantic segmentation. Recent work has explored multi-task learning frameworks to improve robustness, yet negative cross-task interactions often hinder performance. In this work, we introduce Edge-Guided Spatial Attention (EGSA), a fusion mechanism designed to mitigate destructive interactions by incorporating boundary information into the fusion between semantic and geometric features. On both Syn-TODD and ClearPose benchmarks, EGSA consistently improved depth accuracy over the current state of the art method (MODEST), while preserving competitive segmentation performance, with the largest improvements appearing in transparent regions. Besides our fusion design, our second contribution is a multi-modal progressive training strategy, where learning transitions from edges derived from RGB images to edges derived from predicted depth images. This approach allows the system to bootstrap learning from the rich textures contained in RGB images, and then switch to more relevant geometric content in depth maps, while it eliminates the need for ground-truth depth at training time. Together, these contributions highlight edge-guided fusion as a robust approach capable of improving transparent object perception.

Related papers

A Mutual Learning Method for Salient Object Detection with intertwined Multi-Supervision--Revised [67.61878540090116]
We propose to train saliency detection networks by exploiting the supervision from not only salient object detection, but also foreground contour detection and edge detection.<n>First, we leverage salient object detection and foreground contour detection tasks in an intertwined manner to generate saliency maps with uniform highlight.<n>Second, the foreground contour and edge detection tasks guide each other simultaneously, thereby leading to precise foreground contour prediction and reducing the local noises for edge prediction.
arXiv Detail & Related papers (2025-09-21T22:30:32Z)
DCIRNet: Depth Completion with Iterative Refinement for Dexterous Grasping of Transparent and Reflective Objects [9.235004977824026]
We propose DCIRNet, a novel multimodal depth completion network for transparent and reflective objects.<n>Our approach incorporates an innovative multimodal feature fusion module designed to extract complementary information between RGB images and incomplete depth maps.<n>We achieve a $44%$ improvement in the grasp success rate for transparent and reflective objects.
arXiv Detail & Related papers (2025-06-11T08:04:22Z)
Occlusion Boundary and Depth: Mutual Enhancement via Multi-Task Learning [3.4174356345935393]
We propose MoDOT, a novel method that jointly estimates depth and OBs from a single image.<n>MoDOT incorporates a new module, CASM, which combines cross-attention and multi-scale strip convolutions to leverage mid-level OB features.<n>Experiments demonstrate the mutual benefits of jointly estimating depth and OBs, and validate the effectiveness of MoDOT's design.
arXiv Detail & Related papers (2025-05-27T14:15:19Z)
DepthMatch: Semi-Supervised RGB-D Scene Parsing through Depth-Guided Regularization [43.974708665104565]
We introduce DepthMatch, a semi-supervised learning framework that is specifically designed for RGB-D scene parsing.<n>We propose complementary patch mix-up augmentation to explore the latent relationships between texture and spatial features in RGB-D image pairs.<n>We also design a lightweight spatial prior injector to replace traditional complex fusion modules, improving the efficiency of heterogeneous feature fusion.
arXiv Detail & Related papers (2025-05-26T14:26:31Z)
DistillGrasp: Integrating Features Correlation with Knowledge Distillation for Depth Completion of Transparent Objects [4.939414800373192]
RGB-D cameras cannot accurately capture the depth of transparent objects. Recent studies tend to explore new visual features and design complex networks to reconstruct the depth. We propose an efficient depth completion network named DistillGrasp which distillates knowledge from the teacher branch to the student branch.
arXiv Detail & Related papers (2024-08-01T07:17:10Z)
Towards Deeply Unified Depth-aware Panoptic Segmentation with Bi-directional Guidance Learning [63.63516124646916]
We propose a deeply unified framework for depth-aware panoptic segmentation. We propose a bi-directional guidance learning approach to facilitate cross-task feature learning. Our method sets the new state of the art for depth-aware panoptic segmentation on both Cityscapes-DVPS and SemKITTI-DVPS datasets.
arXiv Detail & Related papers (2023-07-27T11:28:33Z)
Towards Reliable Image Outpainting: Learning Structure-Aware Multimodal Fusion with Depth Guidance [49.94504248096527]
We propose a Depth-Guided Outpainting Network (DGONet) to model the feature representations of different modalities. Two components are designed to implement: 1) The Multimodal Learning Module produces unique depth and RGB feature representations from perspectives of different modal characteristics. We specially design an additional constraint strategy consisting of Cross-modal Loss and Edge Loss to enhance ambiguous contours and expedite reliable content generation.
arXiv Detail & Related papers (2022-04-12T06:06:50Z)
Cross-modality Discrepant Interaction Network for RGB-D Salient Object Detection [78.47767202232298]
We propose a novel Cross-modality Discrepant Interaction Network (CDINet) for RGB-D SOD. Two components are designed to implement the effective cross-modality interaction. Our network outperforms $15$ state-of-the-art methods both quantitatively and qualitatively.
arXiv Detail & Related papers (2021-08-04T11:24:42Z)
Accurate RGB-D Salient Object Detection via Collaborative Learning [101.82654054191443]
RGB-D saliency detection shows impressive ability on some challenge scenarios. We propose a novel collaborative learning framework where edge, depth and saliency are leveraged in a more efficient way.
arXiv Detail & Related papers (2020-07-23T04:33:36Z)
Saliency Enhancement using Gradient Domain Edges Merging [65.90255950853674]
We develop a method to merge the edges with the saliency maps to improve the performance of the saliency. This leads to our proposed saliency enhancement using edges (SEE) with an average improvement of at least 3.4 times higher on the DUT-OMRON dataset. The SEE algorithm is split into 2 parts, SEE-Pre for preprocessing and SEE-Post pour postprocessing.
arXiv Detail & Related papers (2020-02-11T14:04:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.