Stereo Video Reconstruction Without Explicit Depth Maps for Endoscopic
Surgery
- URL: http://arxiv.org/abs/2109.08227v1
- Date: Thu, 16 Sep 2021 21:22:43 GMT
- Title: Stereo Video Reconstruction Without Explicit Depth Maps for Endoscopic
Surgery
- Authors: Annika Brundyn, Jesse Swanson, Kyunghyun Cho, Doug Kondziolka, Eric
Oermann
- Abstract summary: We introduce the task of stereo video reconstruction or, equivalently, 2D-to-3D video conversion for minimally invasive surgical video.
We design and implement a series of end-to-end U-Net-based solutions for this task.
We evaluate these solutions by surveying ten experts - surgeons who routinely perform endoscopic surgery.
- Score: 37.531587409884914
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce the task of stereo video reconstruction or, equivalently,
2D-to-3D video conversion for minimally invasive surgical video. We design and
implement a series of end-to-end U-Net-based solutions for this task by varying
the input (single frame vs. multiple consecutive frames), loss function (MSE,
MAE, or perceptual losses), and network architecture. We evaluate these
solutions by surveying ten experts - surgeons who routinely perform endoscopic
surgery. We run two separate reader studies: one evaluating individual frames
and the other evaluating fully reconstructed 3D video played on a VR headset.
In the first reader study, a variant of the U-Net that takes as input multiple
consecutive video frames and outputs the missing view performs best. We draw
two conclusions from this outcome. First, motion information coming from
multiple past frames is crucial in recreating stereo vision. Second, the
proposed U-Net variant can indeed exploit such motion information for solving
this task. The result from the second study further confirms the effectiveness
of the proposed U-Net variant. The surgeons reported that they could
successfully perceive depth from the reconstructed 3D video clips. They also
expressed a clear preference for the reconstructed 3D video over the original
2D video. These two reader studies strongly support the usefulness of the
proposed task of stereo reconstruction for minimally invasive surgical video
and indicate that deep learning is a promising approach to this task. Finally,
we identify two automatic metrics, LPIPS and DISTS, that are strongly
correlated with expert judgement and that could serve as proxies for the latter
in future studies.
Related papers
- Enhancing Temporal Consistency in Video Editing by Reconstructing Videos with 3D Gaussian Splatting [94.84688557937123]
Video-3DGS is a 3D Gaussian Splatting (3DGS)-based video refiner designed to enhance temporal consistency in zero-shot video editors.
Our approach utilizes a two-stage 3D Gaussian optimizing process tailored for editing dynamic monocular videos.
It enhances video editing by ensuring temporal consistency across 58 dynamic monocular videos.
arXiv Detail & Related papers (2024-06-04T17:57:37Z) - Distinguish Any Fake Videos: Unleashing the Power of Large-scale Data and Motion Features [21.583246378475856]
We introduce an extensive video dataset designed specifically for AI-Generated Video Detection (GenVidDet)
We also present the Dual-Branch 3D Transformer (DuB3D), an innovative and effective method for distinguishing between real and generated videos.
DuB3D can distinguish between real and generated video content with 96.77% accuracy, and strong generalization capability even for unseen types.
arXiv Detail & Related papers (2024-05-24T08:26:04Z) - MV2MAE: Multi-View Video Masked Autoencoders [33.61642891911761]
We present a method for self-supervised learning from synchronized multi-view videos.
We use a cross-view reconstruction task to inject geometry information in the model.
Our approach is based on the masked autoencoder (MAE) framework.
arXiv Detail & Related papers (2024-01-29T05:58:23Z) - Diffusion-Guided Reconstruction of Everyday Hand-Object Interaction
Clips [38.02945794078731]
We tackle the task of reconstructing hand-object interactions from short video clips.
Our approach casts 3D inference as a per-video optimization and recovers a neural 3D representation of the object shape.
We empirically evaluate our approach on egocentric videos, and observe significant improvements over prior single-view and multi-view methods.
arXiv Detail & Related papers (2023-09-11T17:58:30Z) - Farm3D: Learning Articulated 3D Animals by Distilling 2D Diffusion [67.71624118802411]
We present Farm3D, a method for learning category-specific 3D reconstructors for articulated objects.
We propose a framework that uses an image generator, such as Stable Diffusion, to generate synthetic training data.
Our network can be used for analysis, including monocular reconstruction, or for synthesis, generating articulated assets for real-time applications such as video games.
arXiv Detail & Related papers (2023-04-20T17:59:34Z) - State of the Art in Dense Monocular Non-Rigid 3D Reconstruction [100.9586977875698]
3D reconstruction of deformable (or non-rigid) scenes from a set of monocular 2D image observations is a long-standing and actively researched area of computer vision and graphics.
This survey focuses on state-of-the-art methods for dense non-rigid 3D reconstruction of various deformable objects and composite scenes from monocular videos or sets of monocular views.
arXiv Detail & Related papers (2022-10-27T17:59:53Z) - SSMTL++: Revisiting Self-Supervised Multi-Task Learning for Video
Anomaly Detection [108.57862846523858]
We revisit the self-supervised multi-task learning framework, proposing several updates to the original method.
We modernize the 3D convolutional backbone by introducing multi-head self-attention modules.
In our attempt to further improve the model, we study additional self-supervised learning tasks, such as predicting segmentation maps.
arXiv Detail & Related papers (2022-07-16T19:25:41Z) - Video Summarization through Reinforcement Learning with a 3D
Spatio-Temporal U-Net [15.032516344808526]
We introduce 3DST-UNet-RL framework for video summarization.
We show experimental evidence for the effectiveness of 3DST-UNet-RL on two commonly used general video summarization benchmarks.
The proposed video summarization has the potential to save storage costs of ultrasound screening videos as well as to increase efficiency when browsing patient video data during retrospective analysis.
arXiv Detail & Related papers (2021-06-19T16:27:19Z) - Human Mesh Recovery from Multiple Shots [85.18244937708356]
We propose a framework for improved 3D reconstruction and mining of long sequences with pseudo ground truth 3D human mesh.
We show that the resulting data is beneficial in the training of various human mesh recovery models.
The tools we develop open the door to processing and analyzing in 3D content from a large library of edited media.
arXiv Detail & Related papers (2020-12-17T18:58:02Z) - 3D Self-Supervised Methods for Medical Imaging [7.65168530693281]
We propose 3D versions for five different self-supervised methods, in the form of proxy tasks.
Our methods facilitate neural network feature learning from unlabeled 3D images, aiming to reduce the required cost for expert annotation.
The developed algorithms are 3D Contrastive Predictive Coding, 3D Rotation prediction, 3D Jigsaw puzzles, Relative 3D patch location, and 3D Exemplar networks.
arXiv Detail & Related papers (2020-06-06T09:56:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.