Glass Segmentation with Fusion of Learned and General Visual Features
- URL: http://arxiv.org/abs/2603.03718v1
- Date: Wed, 04 Mar 2026 04:40:30 GMT
- Title: Glass Segmentation with Fusion of Learned and General Visual Features
- Authors: Risto Ojala, Tristan Ellison, Mo Chen,
- Abstract summary: Glass surface segmentation from RGB images is a challenging task, since glass as a transparent material distinctly lacks visual characteristics.<n>This paper presents a novel architecture for glass segmentation, deploying a dual-backbone producing general visual features as well as task-specific learned visual features.<n>The architecture was evaluated on four commonly used glass segmentation datasets, achieving state-of-the-art results on several accuracy metrics.
- Score: 2.3821941487858935
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Glass surface segmentation from RGB images is a challenging task, since glass as a transparent material distinctly lacks visual characteristics. However, glass segmentation is critical for scene understanding and robotics, as transparent glass surfaces must be identified as solid material. This paper presents a novel architecture for glass segmentation, deploying a dual-backbone producing general visual features as well as task-specific learned visual features. General visual features are produced by a frozen DINOv3 vision foundation model, and the task-specific features are generated with a Swin model trained in a supervised manner. Resulting multi-scale feature representations are downsampled with residual Squeeze-and-Excitation Channel Reduction, and fed into a Mask2Former Decoder, producing the final segmentation masks. The architecture was evaluated on four commonly used glass segmentation datasets, achieving state-of-the-art results on several accuracy metrics. The model also has a competitive inference speed compared to the previous state-of-the-art method, and surpasses it when using a lighter DINOv3 backbone variant. The implementation source code and model weights are available at: https://github.com/ojalar/lgnet
Related papers
- Power of Boundary and Reflection: Semantic Transparent Object Segmentation using Pyramid Vision Transformer with Transparent Cues [35.65981887193136]
We propose incorporating powerful visual cues via the Boundary Feature Enhancement and Reflection Feature Enhancement modules.<n>Our proposed framework, TransCues, is a pyramidal transformer encoder-decoder architecture to segment transparent objects.<n>Our method outperforms the state-of-the-art by a large margin, achieving +4.2% mIoU on Trans10K-v2, +5.6% mIoU on MSD, +10.1% mIoU on RGBD-Mirror, +13.1% mIoU on TROSD, and +8.3% mIoU on Stanford2D3D.
arXiv Detail & Related papers (2025-12-07T22:52:53Z) - 3D Part Segmentation via Geometric Aggregation of 2D Visual Features [57.20161517451834]
Supervised 3D part segmentation models are tailored for a fixed set of objects and parts, limiting their transferability to open-set, real-world scenarios.<n>Recent works have explored vision-language models (VLMs) as a promising alternative, using multi-view rendering and textual prompting to identify object parts.<n>To address these limitations, we propose COPS, a COmprehensive model for Parts that blends semantics extracted from visual concepts and 3D geometry to effectively identify object parts.
arXiv Detail & Related papers (2024-12-05T15:27:58Z) - Glass Segmentation with Multi Scales and Primary Prediction Guiding [2.66512000865131]
Glass-like objects can be seen everywhere in our daily life which are hard for existing methods to segment them.
We propose MGNet, which consists of a FineRescaling and Merging module (FRM) to improve the ability to extract semantics.
We supervise the model with a novel loss function with the uncertainty-aware loss to produce high-confidence segmentation maps.
arXiv Detail & Related papers (2024-02-13T16:14:32Z) - GEM: Boost Simple Network for Glass Surface Segmentation via Segment
Anything Model and Data Synthesis [3.97478982737167]
We show how to segment glass surfaces with higher accuracy using two visual foundation models: Segment Anything (SAM) and Stable Diffusion.
We also propose a Synthetic but large-scale Glass Surface Detection dataset dubbed S-GSD via diffusion model with four different scales.
This dataset is a feasible source for transfer learning. The scale of synthetic data has positive impacts on transfer learning, while the improvement will gradually as the amount of data increases.
arXiv Detail & Related papers (2024-01-27T03:36:47Z) - Leveraging Large-Scale Pretrained Vision Foundation Models for
Label-Efficient 3D Point Cloud Segmentation [67.07112533415116]
We present a novel framework that adapts various foundational models for the 3D point cloud segmentation task.
Our approach involves making initial predictions of 2D semantic masks using different large vision models.
To generate robust 3D semantic pseudo labels, we introduce a semantic label fusion strategy that effectively combines all the results via voting.
arXiv Detail & Related papers (2023-11-03T15:41:15Z) - GEM: Boost Simple Network for Glass Surface Segmentation via Vision Foundation Models [7.423981028880871]
Glass surface detection is a challenging task due to the inherent ambiguity in their transparency and reflective characteristics.
We propose to address these issues by fully harnessing the capabilities of two existing vision foundation models (VFMs): Stable Diffusion and Segment Anything Model (SAM)
Our GEM establishes a new state-of-the-art performance with the help of these two VFMs, surpassing the best-reported method GlassSemNet with an IoU improvement of 2.1%.
arXiv Detail & Related papers (2023-07-22T08:37:23Z) - MvDeCor: Multi-view Dense Correspondence Learning for Fine-grained 3D
Segmentation [91.6658845016214]
We propose to utilize self-supervised techniques in the 2D domain for fine-grained 3D shape segmentation tasks.
We render a 3D shape from multiple views, and set up a dense correspondence learning task within the contrastive learning framework.
As a result, the learned 2D representations are view-invariant and geometrically consistent.
arXiv Detail & Related papers (2022-08-18T00:48:15Z) - Leveraging RGB-D Data with Cross-Modal Context Mining for Glass Surface Detection [47.87834602551456]
Glass surfaces are becoming increasingly ubiquitous as modern buildings tend to use a lot of glass panels.<n>This poses substantial challenges to the operations of autonomous systems such as robots, self-driving cars, and drones.<n>We propose a novel glass surface detection framework combining RGB and depth information.
arXiv Detail & Related papers (2022-06-22T17:56:09Z) - Unsupervised Learning of 3D Object Categories from Videos in the Wild [75.09720013151247]
We focus on learning a model from multiple views of a large collection of object instances.
We propose a new neural network design, called warp-conditioned ray embedding (WCR), which significantly improves reconstruction.
Our evaluation demonstrates performance improvements over several deep monocular reconstruction baselines on existing benchmarks.
arXiv Detail & Related papers (2021-03-30T17:57:01Z) - Enhanced Boundary Learning for Glass-like Object Segmentation [55.45473926510806]
This paper aims to solve the glass-like object segmentation problem via enhanced boundary learning.
In particular, we first propose a novel refined differential module for generating finer boundary cues.
An edge-aware point-based graph convolution network module is proposed to model the global shape representation along the boundary.
arXiv Detail & Related papers (2021-03-29T16:18:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.