GEM: Boost Simple Network for Glass Surface Segmentation via Vision Foundation Models
- URL: http://arxiv.org/abs/2307.12018v2
- Date: Tue, 21 May 2024 06:02:39 GMT
- Title: GEM: Boost Simple Network for Glass Surface Segmentation via Vision Foundation Models
- Authors: Jing Hao, Moyun Liu, Jinrong Yang, Kuo Feng Hung,
- Abstract summary: Glass surface detection is a challenging task due to the inherent ambiguity in their transparency and reflective characteristics.
We propose to address these issues by fully harnessing the capabilities of two existing vision foundation models (VFMs): Stable Diffusion and Segment Anything Model (SAM)
Our GEM establishes a new state-of-the-art performance with the help of these two VFMs, surpassing the best-reported method GlassSemNet with an IoU improvement of 2.1%.
- Score: 7.423981028880871
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Detecting glass regions is a challenging task due to the inherent ambiguity in their transparency and reflective characteristics. Current solutions in this field remain rooted in conventional deep learning paradigms, requiring the construction of annotated datasets and the design of network architectures. However, the evident drawback with these mainstream solutions lies in the time-consuming and labor-intensive process of curating datasets, alongside the increasing complexity of model structures. In this paper, we propose to address these issues by fully harnessing the capabilities of two existing vision foundation models (VFMs): Stable Diffusion and Segment Anything Model (SAM). Firstly, we construct a Synthetic but photorealistic large-scale Glass Surface Detection dataset, dubbed S-GSD, without any labour cost via Stable Diffusion. This dataset consists of four different scales, consisting of 168k images totally with precise masks. Besides, based on the powerful segmentation ability of SAM, we devise a simple Glass surface sEgMentor named GEM, which follows the simple query-based encoder-decoder architecture. Comprehensive experiments are conducted on the large-scale glass segmentation dataset GSD-S. Our GEM establishes a new state-of-the-art performance with the help of these two VFMs, surpassing the best-reported method GlassSemNet with an IoU improvement of 2.1%. Additionally, extensive experiments demonstrate that our synthetic dataset S-GSD exhibits remarkable performance in zero-shot and transfer learning settings. Codes, datasets and models are publicly available at: https://github.com/isbrycee/GEM
Related papers
- Mono2Stereo: Monocular Knowledge Transfer for Enhanced Stereo Matching [7.840781070208874]
We propose leveraging monocular knowledge transfer to enhance stereo matching, namely Mono2Stereo.
We introduce knowledge transfer with a two-stage training process, comprising synthetic data pre-training and real-world data fine-tuning.
Experimental results demonstrate that our pre-trained model exhibits strong zero-shot capabilities.
arXiv Detail & Related papers (2024-11-14T03:01:36Z) - Adapting Segment Anything Model for Unseen Object Instance Segmentation [70.60171342436092]
Unseen Object Instance (UOIS) is crucial for autonomous robots operating in unstructured environments.
We propose UOIS-SAM, a data-efficient solution for the UOIS task.
UOIS-SAM integrates two key components: (i) a Heatmap-based Prompt Generator (HPG) to generate class-agnostic point prompts with precise foreground prediction, and (ii) a Hierarchical Discrimination Network (HDNet) that adapts SAM's mask decoder.
arXiv Detail & Related papers (2024-09-23T19:05:50Z) - A Framework for Fine-Tuning LLMs using Heterogeneous Feedback [69.51729152929413]
We present a framework for fine-tuning large language models (LLMs) using heterogeneous feedback.
First, we combine the heterogeneous feedback data into a single supervision format, compatible with methods like SFT and RLHF.
Next, given this unified feedback dataset, we extract a high-quality and diverse subset to obtain performance increases.
arXiv Detail & Related papers (2024-08-05T23:20:32Z) - Kick Back & Relax++: Scaling Beyond Ground-Truth Depth with SlowTV &
CribsTV [50.616892315086574]
This paper proposes two novel datasets: SlowTV and CribsTV.
These are large-scale datasets curated from publicly available YouTube videos, containing a total of 2M training frames.
We leverage these datasets to tackle the challenging task of zero-shot generalization.
arXiv Detail & Related papers (2024-03-03T17:29:03Z) - Glass Segmentation with Multi Scales and Primary Prediction Guiding [2.66512000865131]
Glass-like objects can be seen everywhere in our daily life which are hard for existing methods to segment them.
We propose MGNet, which consists of a FineRescaling and Merging module (FRM) to improve the ability to extract semantics.
We supervise the model with a novel loss function with the uncertainty-aware loss to produce high-confidence segmentation maps.
arXiv Detail & Related papers (2024-02-13T16:14:32Z) - GEM: Boost Simple Network for Glass Surface Segmentation via Segment
Anything Model and Data Synthesis [3.97478982737167]
We show how to segment glass surfaces with higher accuracy using two visual foundation models: Segment Anything (SAM) and Stable Diffusion.
We also propose a Synthetic but large-scale Glass Surface Detection dataset dubbed S-GSD via diffusion model with four different scales.
This dataset is a feasible source for transfer learning. The scale of synthetic data has positive impacts on transfer learning, while the improvement will gradually as the amount of data increases.
arXiv Detail & Related papers (2024-01-27T03:36:47Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - Text2Seg: Remote Sensing Image Semantic Segmentation via Text-Guided Visual Foundation Models [7.452422412106768]
We propose a novel method named Text2Seg for remote sensing semantic segmentation.
It overcomes the dependency on extensive annotations by employing an automatic prompt generation process.
We show that Text2Seg significantly improves zero-shot prediction performance compared to the vanilla SAM model.
arXiv Detail & Related papers (2023-04-20T18:39:41Z) - Salient Objects in Clutter [130.63976772770368]
This paper identifies and addresses a serious design bias of existing salient object detection (SOD) datasets.
This design bias has led to a saturation in performance for state-of-the-art SOD models when evaluated on existing datasets.
We propose a new high-quality dataset and update the previous saliency benchmark.
arXiv Detail & Related papers (2021-05-07T03:49:26Z) - Enhanced Boundary Learning for Glass-like Object Segmentation [55.45473926510806]
This paper aims to solve the glass-like object segmentation problem via enhanced boundary learning.
In particular, we first propose a novel refined differential module for generating finer boundary cues.
An edge-aware point-based graph convolution network module is proposed to model the global shape representation along the boundary.
arXiv Detail & Related papers (2021-03-29T16:18:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.