Stereo Video Reconstruction Without Explicit Depth Maps for Endoscopic
Surgery
- URL: http://arxiv.org/abs/2109.08227v1
- Date: Thu, 16 Sep 2021 21:22:43 GMT
- Title: Stereo Video Reconstruction Without Explicit Depth Maps for Endoscopic
Surgery
- Authors: Annika Brundyn, Jesse Swanson, Kyunghyun Cho, Doug Kondziolka, Eric
Oermann
- Abstract summary: We introduce the task of stereo video reconstruction or, equivalently, 2D-to-3D video conversion for minimally invasive surgical video.
We design and implement a series of end-to-end U-Net-based solutions for this task.
We evaluate these solutions by surveying ten experts - surgeons who routinely perform endoscopic surgery.
- Score: 37.531587409884914
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce the task of stereo video reconstruction or, equivalently,
2D-to-3D video conversion for minimally invasive surgical video. We design and
implement a series of end-to-end U-Net-based solutions for this task by varying
the input (single frame vs. multiple consecutive frames), loss function (MSE,
MAE, or perceptual losses), and network architecture. We evaluate these
solutions by surveying ten experts - surgeons who routinely perform endoscopic
surgery. We run two separate reader studies: one evaluating individual frames
and the other evaluating fully reconstructed 3D video played on a VR headset.
In the first reader study, a variant of the U-Net that takes as input multiple
consecutive video frames and outputs the missing view performs best. We draw
two conclusions from this outcome. First, motion information coming from
multiple past frames is crucial in recreating stereo vision. Second, the
proposed U-Net variant can indeed exploit such motion information for solving
this task. The result from the second study further confirms the effectiveness
of the proposed U-Net variant. The surgeons reported that they could
successfully perceive depth from the reconstructed 3D video clips. They also
expressed a clear preference for the reconstructed 3D video over the original
2D video. These two reader studies strongly support the usefulness of the
proposed task of stereo reconstruction for minimally invasive surgical video
and indicate that deep learning is a promising approach to this task. Finally,
we identify two automatic metrics, LPIPS and DISTS, that are strongly
correlated with expert judgement and that could serve as proxies for the latter
in future studies.
Related papers
- Learning-based Multi-View Stereo: A Survey [55.3096230732874]
Multi-View Stereo (MVS) algorithms synthesize a comprehensive 3D representation, enabling precise reconstruction in complex environments.
With the success of deep learning, many learning-based MVS methods have been proposed, achieving impressive performance against traditional methods.
arXiv Detail & Related papers (2024-08-27T17:53:18Z) - EmbodiedSAM: Online Segment Any 3D Thing in Real Time [61.2321497708998]
Embodied tasks require the agent to fully understand 3D scenes simultaneously with its exploration.
An online, real-time, fine-grained and highly-generalized 3D perception model is desperately needed.
arXiv Detail & Related papers (2024-08-21T17:57:06Z) - A Review of 3D Reconstruction Techniques for Deformable Tissues in Robotic Surgery [8.909938295090827]
NeRF-based techniques have recently garnered attention for the ability to reconstruct scenes implicitly.
On the other hand, 3D-GS represents scenes explicitly using 3D Gaussians and projects them onto a 2D plane as a replacement for the complex volume rendering in NeRF.
This work explores and reviews state-of-the-art (SOTA) approaches, discussing their innovations and implementation principles.
arXiv Detail & Related papers (2024-08-08T12:51:23Z) - Enhancing Temporal Consistency in Video Editing by Reconstructing Videos with 3D Gaussian Splatting [94.84688557937123]
Video-3DGS is a 3D Gaussian Splatting (3DGS)-based video refiner designed to enhance temporal consistency in zero-shot video editors.
Our approach utilizes a two-stage 3D Gaussian optimizing process tailored for editing dynamic monocular videos.
It enhances video editing by ensuring temporal consistency across 58 dynamic monocular videos.
arXiv Detail & Related papers (2024-06-04T17:57:37Z) - MV2MAE: Multi-View Video Masked Autoencoders [33.61642891911761]
We present a method for self-supervised learning from synchronized multi-view videos.
We use a cross-view reconstruction task to inject geometry information in the model.
Our approach is based on the masked autoencoder (MAE) framework.
arXiv Detail & Related papers (2024-01-29T05:58:23Z) - Diffusion-Guided Reconstruction of Everyday Hand-Object Interaction
Clips [38.02945794078731]
We tackle the task of reconstructing hand-object interactions from short video clips.
Our approach casts 3D inference as a per-video optimization and recovers a neural 3D representation of the object shape.
We empirically evaluate our approach on egocentric videos, and observe significant improvements over prior single-view and multi-view methods.
arXiv Detail & Related papers (2023-09-11T17:58:30Z) - Geometry-Aware Attenuation Learning for Sparse-View CBCT Reconstruction [53.93674177236367]
Cone Beam Computed Tomography (CBCT) plays a vital role in clinical imaging.
Traditional methods typically require hundreds of 2D X-ray projections to reconstruct a high-quality 3D CBCT image.
This has led to a growing interest in sparse-view CBCT reconstruction to reduce radiation doses.
We introduce a novel geometry-aware encoder-decoder framework to solve this problem.
arXiv Detail & Related papers (2023-03-26T14:38:42Z) - SSMTL++: Revisiting Self-Supervised Multi-Task Learning for Video
Anomaly Detection [108.57862846523858]
We revisit the self-supervised multi-task learning framework, proposing several updates to the original method.
We modernize the 3D convolutional backbone by introducing multi-head self-attention modules.
In our attempt to further improve the model, we study additional self-supervised learning tasks, such as predicting segmentation maps.
arXiv Detail & Related papers (2022-07-16T19:25:41Z) - Video Summarization through Reinforcement Learning with a 3D
Spatio-Temporal U-Net [15.032516344808526]
We introduce 3DST-UNet-RL framework for video summarization.
We show experimental evidence for the effectiveness of 3DST-UNet-RL on two commonly used general video summarization benchmarks.
The proposed video summarization has the potential to save storage costs of ultrasound screening videos as well as to increase efficiency when browsing patient video data during retrospective analysis.
arXiv Detail & Related papers (2021-06-19T16:27:19Z) - 3D Self-Supervised Methods for Medical Imaging [7.65168530693281]
We propose 3D versions for five different self-supervised methods, in the form of proxy tasks.
Our methods facilitate neural network feature learning from unlabeled 3D images, aiming to reduce the required cost for expert annotation.
The developed algorithms are 3D Contrastive Predictive Coding, 3D Rotation prediction, 3D Jigsaw puzzles, Relative 3D patch location, and 3D Exemplar networks.
arXiv Detail & Related papers (2020-06-06T09:56:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.