Related papers: GEM: Boost Simple Network for Glass Surface Segmentation via Vision Foundation Models

GEM: Boost Simple Network for Glass Surface Segmentation via Vision Foundation Models

URL: http://arxiv.org/abs/2307.12018v2
Date: Tue, 21 May 2024 06:02:39 GMT
Title: GEM: Boost Simple Network for Glass Surface Segmentation via Vision Foundation Models
Authors: Jing Hao, Moyun Liu, Jinrong Yang, Kuo Feng Hung,
Abstract summary: Glass surface detection is a challenging task due to the inherent ambiguity in their transparency and reflective characteristics. We propose to address these issues by fully harnessing the capabilities of two existing vision foundation models (VFMs): Stable Diffusion and Segment Anything Model (SAM) Our GEM establishes a new state-of-the-art performance with the help of these two VFMs, surpassing the best-reported method GlassSemNet with an IoU improvement of 2.1%.
Score: 7.423981028880871
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Detecting glass regions is a challenging task due to the inherent ambiguity in their transparency and reflective characteristics. Current solutions in this field remain rooted in conventional deep learning paradigms, requiring the construction of annotated datasets and the design of network architectures. However, the evident drawback with these mainstream solutions lies in the time-consuming and labor-intensive process of curating datasets, alongside the increasing complexity of model structures. In this paper, we propose to address these issues by fully harnessing the capabilities of two existing vision foundation models (VFMs): Stable Diffusion and Segment Anything Model (SAM). Firstly, we construct a Synthetic but photorealistic large-scale Glass Surface Detection dataset, dubbed S-GSD, without any labour cost via Stable Diffusion. This dataset consists of four different scales, consisting of 168k images totally with precise masks. Besides, based on the powerful segmentation ability of SAM, we devise a simple Glass surface sEgMentor named GEM, which follows the simple query-based encoder-decoder architecture. Comprehensive experiments are conducted on the large-scale glass segmentation dataset GSD-S. Our GEM establishes a new state-of-the-art performance with the help of these two VFMs, surpassing the best-reported method GlassSemNet with an IoU improvement of 2.1%. Additionally, extensive experiments demonstrate that our synthetic dataset S-GSD exhibits remarkable performance in zero-shot and transfer learning settings. Codes, datasets and models are publicly available at: https://github.com/isbrycee/GEM

Related papers

SAM2-UNeXT: An Improved High-Resolution Baseline for Adapting Foundation Models to Downstream Segmentation Tasks [50.97089872043121]
We propose SAM2-UNeXT, an advanced framework that builds upon the core principles of SAM2-UNet.<n>We extend the representational capacity of SAM2 through the integration of an auxiliary DINOv2 encoder.<n>Our approach enables more accurate segmentation with a simple architecture, relaxing the need for complex decoder designs.
arXiv Detail & Related papers (2025-08-05T15:36:13Z)
RemoteSAM: Towards Segment Anything for Earth Observation [29.707796048411705]
We aim to develop a robust yet flexible visual foundation model for Earth observation.<n>It should possess strong capabilities in recognizing and localizing diverse visual targets.<n>We present RemoteSAM, a foundation model that establishes new SoTA on several earth observation perception benchmarks.
arXiv Detail & Related papers (2025-05-23T15:27:57Z)
TUM2TWIN: Introducing the Large-Scale Multimodal Urban Digital Twin Benchmark Dataset [90.97440987655084]
Urban Digital Twins (UDTs) have become essential for managing cities and integrating complex, heterogeneous data from diverse sources.<n>To address these challenges, we introduce the first comprehensive multimodal Urban Digital Twin benchmark dataset: TUM2TWIN.<n>This dataset includes georeferenced, semantically aligned 3D models and networks along with various terrestrial, mobile, aerial, and satellite observations boasting 32 data subsets over roughly 100,000 $m2$ and currently 767 GB of data.
arXiv Detail & Related papers (2025-05-12T09:48:32Z)
SEKI: Self-Evolution and Knowledge Inspiration based Neural Architecture Search via Large Language Models [11.670056503731905]
We introduce SEKI, a novel large language model (LLM)-based neural architecture search (NAS) method. Inspired by the chain-of-thought (CoT) paradigm in modern LLMs, SEKI operates in two key stages: self-evolution and knowledge distillation.
arXiv Detail & Related papers (2025-02-27T09:17:49Z)
UrbanSAM: Learning Invariance-Inspired Adapters for Segment Anything Models in Urban Construction [51.54946346023673]
Urban morphology is inherently complex, with irregular objects of diverse shapes and varying scales. The Segment Anything Model (SAM) has shown significant potential in segmenting complex scenes. We propose UrbanSAM, a customized version of SAM specifically designed to analyze complex urban environments.
arXiv Detail & Related papers (2025-02-21T04:25:19Z)
Monocular Depth Estimation and Segmentation for Transparent Object with Iterative Semantic and Geometric Fusion [9.391182087420926]
We propose a monocular framework, which is the first to excel in both segmentation and depth estimation of transparent objects. Specifically, we devise a novel semantic and geometric fusion module, effectively integrating the multi-scale information between tasks. Experiments on two challenging synthetic and real-world datasets demonstrate that our model surpasses state-of-the-art monocular, stereo, and multi-view methods by a large margin.
arXiv Detail & Related papers (2025-02-20T14:57:01Z)
ZISVFM: Zero-Shot Object Instance Segmentation in Indoor Robotic Environments with Vision Foundation Models [10.858627659431928]
Service robots must effectively recognize and segment unknown objects to enhance their functionality. Traditional supervised learningbased segmentation techniques require extensive annotated datasets. This paper proposes a novel approach (ZISVFM) for solving UOIS by leveraging the powerful zero-shot capability of the segment anything model (SAM) and explicit visual representations from a selfsupervised vision transformer (ViT)
arXiv Detail & Related papers (2025-02-05T15:22:20Z)
SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape Estimation [81.36747103102459]
Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications. Current state-of-the-art methods focus on training innovative architectural designs on confined datasets. We investigate the impact of scaling up EHPS towards a family of generalist foundation models.
arXiv Detail & Related papers (2025-01-16T18:59:46Z)
Mono2Stereo: Monocular Knowledge Transfer for Enhanced Stereo Matching [7.840781070208874]
We propose leveraging monocular knowledge transfer to enhance stereo matching, namely Mono2Stereo. We introduce knowledge transfer with a two-stage training process, comprising synthetic data pre-training and real-world data fine-tuning. Experimental results demonstrate that our pre-trained model exhibits strong zero-shot capabilities.
arXiv Detail & Related papers (2024-11-14T03:01:36Z)
Adapting Segment Anything Model for Unseen Object Instance Segmentation [70.60171342436092]
Unseen Object Instance (UOIS) is crucial for autonomous robots operating in unstructured environments. We propose UOIS-SAM, a data-efficient solution for the UOIS task. UOIS-SAM integrates two key components: (i) a Heatmap-based Prompt Generator (HPG) to generate class-agnostic point prompts with precise foreground prediction, and (ii) a Hierarchical Discrimination Network (HDNet) that adapts SAM's mask decoder.
arXiv Detail & Related papers (2024-09-23T19:05:50Z)
A Framework for Fine-Tuning LLMs using Heterogeneous Feedback [69.51729152929413]
We present a framework for fine-tuning large language models (LLMs) using heterogeneous feedback. First, we combine the heterogeneous feedback data into a single supervision format, compatible with methods like SFT and RLHF. Next, given this unified feedback dataset, we extract a high-quality and diverse subset to obtain performance increases.
arXiv Detail & Related papers (2024-08-05T23:20:32Z)
Kick Back & Relax++: Scaling Beyond Ground-Truth Depth with SlowTV & CribsTV [50.616892315086574]
This paper proposes two novel datasets: SlowTV and CribsTV. These are large-scale datasets curated from publicly available YouTube videos, containing a total of 2M training frames. We leverage these datasets to tackle the challenging task of zero-shot generalization.
arXiv Detail & Related papers (2024-03-03T17:29:03Z)
Glass Segmentation with Multi Scales and Primary Prediction Guiding [2.66512000865131]
Glass-like objects can be seen everywhere in our daily life which are hard for existing methods to segment them. We propose MGNet, which consists of a FineRescaling and Merging module (FRM) to improve the ability to extract semantics. We supervise the model with a novel loss function with the uncertainty-aware loss to produce high-confidence segmentation maps.
arXiv Detail & Related papers (2024-02-13T16:14:32Z)
GEM: Boost Simple Network for Glass Surface Segmentation via Segment Anything Model and Data Synthesis [3.97478982737167]
We show how to segment glass surfaces with higher accuracy using two visual foundation models: Segment Anything (SAM) and Stable Diffusion. We also propose a Synthetic but large-scale Glass Surface Detection dataset dubbed S-GSD via diffusion model with four different scales. This dataset is a feasible source for transfer learning. The scale of synthetic data has positive impacts on transfer learning, while the improvement will gradually as the amount of data increases.
arXiv Detail & Related papers (2024-01-27T03:36:47Z)
Contrastive Transformer Learning with Proximity Data Generation for Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery. Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data. In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z)
Text2Seg: Remote Sensing Image Semantic Segmentation via Text-Guided Visual Foundation Models [7.452422412106768]
We propose a novel method named Text2Seg for remote sensing semantic segmentation. It overcomes the dependency on extensive annotations by employing an automatic prompt generation process. We show that Text2Seg significantly improves zero-shot prediction performance compared to the vanilla SAM model.
arXiv Detail & Related papers (2023-04-20T18:39:41Z)
Salient Objects in Clutter [130.63976772770368]
This paper identifies and addresses a serious design bias of existing salient object detection (SOD) datasets. This design bias has led to a saturation in performance for state-of-the-art SOD models when evaluated on existing datasets. We propose a new high-quality dataset and update the previous saliency benchmark.
arXiv Detail & Related papers (2021-05-07T03:49:26Z)
Enhanced Boundary Learning for Glass-like Object Segmentation [55.45473926510806]
This paper aims to solve the glass-like object segmentation problem via enhanced boundary learning. In particular, we first propose a novel refined differential module for generating finer boundary cues. An edge-aware point-based graph convolution network module is proposed to model the global shape representation along the boundary.
arXiv Detail & Related papers (2021-03-29T16:18:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.