Multi-Label Stereo Matching for Transparent Scene Depth Estimation
- URL: http://arxiv.org/abs/2505.14008v1
- Date: Tue, 20 May 2025 07:03:57 GMT
- Title: Multi-Label Stereo Matching for Transparent Scene Depth Estimation
- Authors: Zhidan Liu, Chengtang Yao, Jiaxi Zeng, Yuwei Wu, Yunde Jia,
- Abstract summary: We present a multi-label stereo matching method to simultaneously estimate the depth of the transparent objects and the occluded background in transparent scenes.<n>We also synthesize a dataset containing 10 scenes and 89 objects to validate the performance of transparent scene depth estimation.
- Score: 26.871684446738975
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we present a multi-label stereo matching method to simultaneously estimate the depth of the transparent objects and the occluded background in transparent scenes.Unlike previous methods that assume a unimodal distribution along the disparity dimension and formulate the matching as a single-label regression problem, we propose a multi-label regression formulation to estimate multiple depth values at the same pixel in transparent scenes. To resolve the multi-label regression problem, we introduce a pixel-wise multivariate Gaussian representation, where the mean vector encodes multiple depth values at the same pixel, and the covariance matrix determines whether a multi-label representation is necessary for a given pixel. The representation is iteratively predicted within a GRU framework. In each iteration, we first predict the update step for the mean parameters and then use both the update step and the updated mean parameters to estimate the covariance matrix. We also synthesize a dataset containing 10 scenes and 89 objects to validate the performance of transparent scene depth estimation. The experiments show that our method greatly improves the performance on transparent surfaces while preserving the background information for scene reconstruction. Code is available at https://github.com/BFZD233/TranScene.
Related papers
- LlamaSeg: Image Segmentation via Autoregressive Mask Generation [46.17509085054758]
We present LlamaSeg, a visual autoregressive framework that unifies multiple image segmentation tasks via natural language instructions.<n>We reformulate image segmentation as a visual generation problem, representing masks as "visual" tokens and employing a LLaMA-style Transformer to predict them directly from image inputs.
arXiv Detail & Related papers (2025-05-26T02:22:41Z) - A Global Depth-Range-Free Multi-View Stereo Transformer Network with Pose Embedding [76.44979557843367]
We propose a novel multi-view stereo (MVS) framework that gets rid of the depth range prior.<n>We introduce a Multi-view Disparity Attention (MDA) module to aggregate long-range context information.<n>We explicitly estimate the quality of the current pixel corresponding to sampled points on the epipolar line of the source image.
arXiv Detail & Related papers (2024-11-04T08:50:16Z) - MultiDepth: Multi-Sample Priors for Refining Monocular Metric Depth Estimations in Indoor Scenes [0.0]
Existing models are sensitive to factors such as boundary frequency of objects in the scene and scene complexity.
We propose a solution by taking samples of the image along with the initial depth map prediction made by a pre-trained MMDE model.
Compared to existing iterative depth refinement techniques, MultiDepth does not employ normal map prediction as part of its architecture.
arXiv Detail & Related papers (2024-11-01T21:30:51Z) - Pixel-Aligned Multi-View Generation with Depth Guided Decoder [86.1813201212539]
We propose a novel method for pixel-level image-to-multi-view generation.
Unlike prior work, we incorporate attention layers across multi-view images in the VAE decoder of a latent video diffusion model.
Our model enables better pixel alignment across multi-view images.
arXiv Detail & Related papers (2024-08-26T04:56:41Z) - Semi-supervised Counting via Pixel-by-pixel Density Distribution
Modelling [135.66138766927716]
This paper focuses on semi-supervised crowd counting, where only a small portion of the training data are labeled.
We formulate the pixel-wise density value to regress as a probability distribution, instead of a single deterministic value.
Our method clearly outperforms the competitors by a large margin under various labeled ratio settings.
arXiv Detail & Related papers (2024-02-23T12:48:02Z) - Video Frame Interpolation with Many-to-many Splatting and Spatial
Selective Refinement [83.60486465697318]
We propose a fully differentiable Many-to-Many (M2M) splatting framework to interpolate frames efficiently.
For each input frame pair, M2M has a minuscule computational overhead when interpolating an arbitrary number of in-between frames.
We extend an M2M++ framework by introducing a flexible Spatial Selective Refinement component, which allows for trading computational efficiency for quality and vice versa.
arXiv Detail & Related papers (2023-10-29T09:09:32Z) - Multiscale Representation for Real-Time Anti-Aliasing Neural Rendering [84.37776381343662]
Mip-NeRF proposes a multiscale representation as a conical frustum to encode scale information.
We propose mip voxel grids (Mip-VoG), an explicit multiscale representation for real-time anti-aliasing rendering.
Our approach is the first to offer multiscale training and real-time anti-aliasing rendering simultaneously.
arXiv Detail & Related papers (2023-04-20T04:05:22Z) - Multi-modal Large Language Model Enhanced Pseudo 3D Perception Framework
for Visual Commonsense Reasoning [24.29849761674329]
Representative works first recognize objects in images and then associate them with key words in texts.
An MLLM enhanced pseudo 3D perception framework is designed for visual commonsense reasoning.
Experiments on the VCR dataset demonstrate the superiority of the proposed framework over state-of-the-art approaches.
arXiv Detail & Related papers (2023-01-30T23:43:28Z) - VoGE: A Differentiable Volume Renderer using Gaussian Ellipsoids for
Analysis-by-Synthesis [62.47221232706105]
We propose VoGE, which utilizes the Gaussian reconstruction kernels as volumetric primitives.
To efficiently render via VoGE, we propose an approximate closeform solution for the volume density aggregation and a coarse-to-fine rendering strategy.
VoGE outperforms SoTA when applied to various vision tasks, e.g., object pose estimation, shape/texture fitting, and reasoning.
arXiv Detail & Related papers (2022-05-30T19:52:11Z) - Adaptive Context-Aware Multi-Modal Network for Depth Completion [107.15344488719322]
We propose to adopt the graph propagation to capture the observed spatial contexts.
We then apply the attention mechanism on the propagation, which encourages the network to model the contextual information adaptively.
Finally, we introduce the symmetric gated fusion strategy to exploit the extracted multi-modal features effectively.
Our model, named Adaptive Context-Aware Multi-Modal Network (ACMNet), achieves the state-of-the-art performance on two benchmarks.
arXiv Detail & Related papers (2020-08-25T06:00:06Z) - Autoregressive Unsupervised Image Segmentation [8.894935073145252]
We propose a new unsupervised image segmentation approach based on mutual information between different views constructed of the inputs.
The proposed method outperforms current state-of-the-art on unsupervised image segmentation.
arXiv Detail & Related papers (2020-07-16T10:47:40Z) - Multi-Granularity Canonical Appearance Pooling for Remote Sensing Scene
Classification [0.34376560669160383]
We propose a novel Multi-Granularity Canonical Appearance Pooling (MG-CAP) to automatically capture the latent ontological structure of remote sensing datasets.
For each specific granularity, we discover the canonical appearance from a set of pre-defined transformations and learn the corresponding CNN features through a maxout-based Siamese style architecture.
We provide a stable solution for training the eigenvalue-decomposition function (EIG) in a GPU and demonstrate the corresponding back-propagation using matrix calculus.
arXiv Detail & Related papers (2020-04-09T11:24:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.