Joint Depth Prediction and Semantic Segmentation with Multi-View SAM
- URL: http://arxiv.org/abs/2311.00134v1
- Date: Tue, 31 Oct 2023 20:15:40 GMT
- Title: Joint Depth Prediction and Semantic Segmentation with Multi-View SAM
- Authors: Mykhailo Shvets, Dongxu Zhao, Marc Niethammer, Roni Sengupta,
Alexander C. Berg
- Abstract summary: We propose a Multi-View Stereo (MVS) technique for depth prediction that benefits from rich semantic features of the Segment Anything Model (SAM)
This enhanced depth prediction, in turn, serves as a prompt to our Transformer-based semantic segmentation decoder.
- Score: 59.99496827912684
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-task approaches to joint depth and segmentation prediction are
well-studied for monocular images. Yet, predictions from a single-view are
inherently limited, while multiple views are available in many robotics
applications. On the other end of the spectrum, video-based and full 3D methods
require numerous frames to perform reconstruction and segmentation. With this
work we propose a Multi-View Stereo (MVS) technique for depth prediction that
benefits from rich semantic features of the Segment Anything Model (SAM). This
enhanced depth prediction, in turn, serves as a prompt to our Transformer-based
semantic segmentation decoder. We report the mutual benefit that both tasks
enjoy in our quantitative and qualitative studies on the ScanNet dataset. Our
approach consistently outperforms single-task MVS and segmentation models,
along with multi-task monocular methods.
Related papers
- A Multitask Deep Learning Model for Classification and Regression of Hyperspectral Images: Application to the large-scale dataset [44.94304541427113]
We propose a multitask deep learning model to perform multiple classification and regression tasks simultaneously on hyperspectral images.
We validated our approach on a large hyperspectral dataset called TAIGA.
A comprehensive qualitative and quantitative analysis of the results shows that the proposed method significantly outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2024-07-23T11:14:54Z) - Frequency-based Matcher for Long-tailed Semantic Segmentation [22.199174076366003]
We focus on a relatively under-explored task setting, long-tailed semantic segmentation (LTSS)
We propose a dual-metric evaluation system and construct the LTSS benchmark to demonstrate the performance of semantic segmentation methods and long-tailed solutions.
We also propose a transformer-based algorithm to improve LTSS, frequency-based matcher, which solves the oversuppression problem by one-to-many matching.
arXiv Detail & Related papers (2024-06-06T09:57:56Z) - Semantic-SAM: Segment and Recognize Anything at Any Granularity [83.64686655044765]
We introduce Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any desired granularity.
We consolidate multiple datasets across three granularities and introduce decoupled classification for objects and parts.
For the multi-granularity capability, we propose a multi-choice learning scheme during training, enabling each click to generate masks at multiple levels.
arXiv Detail & Related papers (2023-07-10T17:59:40Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - AIMS: All-Inclusive Multi-Level Segmentation [93.5041381700744]
We propose a new task, All-Inclusive Multi-Level (AIMS), which segments visual regions into three levels: part, entity, and relation.
We also build a unified AIMS model through multi-dataset multi-task training to address the two major challenges of annotation inconsistency and task correlation.
arXiv Detail & Related papers (2023-05-28T16:28:49Z) - MED-VT++: Unifying Multimodal Learning with a Multiscale Encoder-Decoder Video Transformer [12.544216587327387]
We present an end-to-end trainable unified multiscale encoder-decoder transformer that is focused on dense prediction tasks in video.
The presented Multiscale-Decoder Video (MED-VT) uses multiscale representation throughout and employs an optional input beyond video.
We present a transductive learning scheme through many-to-many label propagation to provide temporally consistent video predictions.
arXiv Detail & Related papers (2023-04-12T15:50:19Z) - BEVerse: Unified Perception and Prediction in Birds-Eye-View for
Vision-Centric Autonomous Driving [92.05963633802979]
We present BEVerse, a unified framework for 3D perception and prediction based on multi-camera systems.
We show that the multi-task BEVerse outperforms single-task methods on 3D object detection, semantic map construction, and motion prediction.
arXiv Detail & Related papers (2022-05-19T17:55:35Z) - MulT: An End-to-End Multitask Learning Transformer [66.52419626048115]
We propose an end-to-end Multitask Learning Transformer framework, named MulT, to simultaneously learn multiple high-level vision tasks.
Our framework encodes the input image into a shared representation and makes predictions for each vision task using task-specific transformer-based decoder heads.
arXiv Detail & Related papers (2022-05-17T13:03:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.