Related papers: Self-Supervised Video Desmoking for Laparoscopic Surgery

Self-Supervised Video Desmoking for Laparoscopic Surgery

URL: http://arxiv.org/abs/2403.11192v2
Date: Thu, 15 Aug 2024 12:52:13 GMT
Title: Self-Supervised Video Desmoking for Laparoscopic Surgery
Authors: Renlong Wu, Zhilu Zhang, Shuohao Zhang, Longfei Gou, Haobin Chen, Lei Zhang, Hao Chen, Wangmeng Zuo,
Abstract summary: We introduce self-supervised surgery video desmoking (SelfSVD) We observe that the frame captured before the activation of high-energy devices is generally clear (named pre-smoke frame, PS frame) We further feed the valuable information from PS frame into models, where a masking strategy and a regularization term are presented to avoid trivial solutions.
Score: 48.83900673665993
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Due to the difficulty of collecting real paired data, most existing desmoking methods train the models by synthesizing smoke, generalizing poorly to real surgical scenarios. Although a few works have explored single-image real-world desmoking in unpaired learning manners, they still encounter challenges in handling dense smoke. In this work, we address these issues together by introducing the self-supervised surgery video desmoking (SelfSVD). On the one hand, we observe that the frame captured before the activation of high-energy devices is generally clear (named pre-smoke frame, PS frame), thus it can serve as supervision for other smoky frames, making real-world self-supervised video desmoking practically feasible. On the other hand, in order to enhance the desmoking performance, we further feed the valuable information from PS frame into models, where a masking strategy and a regularization term are presented to avoid trivial solutions. In addition, we construct a real surgery video dataset for desmoking, which covers a variety of smoky scenes. Extensive experiments on the dataset show that our SelfSVD can remove smoke more effectively and efficiently while recovering more photo-realistic details than the state-of-the-art methods. The dataset, codes, and pre-trained models are available at \url{https://github.com/ZcsrenlongZ/SelfSVD}.

Related papers

RoboMirror: Understand Before You Imitate for Video to Humanoid Locomotion [59.51253426975907]
State-of-the-art humanoid locomotion systems rely on either curated motion capture trajectories or text commands.<n>We propose RoboMirror, the first-free video-to-locomotion framework embodying "understand before you imitate"
arXiv Detail & Related papers (2025-12-29T17:59:19Z)
Rethinking Surgical Smoke: A Smoke-Type-Aware Laparoscopic Video Desmoking Method and Dataset [21.493577935588732]
Smoke-type-Aware Laparoscopic Video Desmoking Network (STANet)<n>We introduce two smoke types: Diffusion Smoke and Ambient Smoke.<n>We also construct the first large-scale synthetic smoking dataset with smoke type annotations.
arXiv Detail & Related papers (2025-12-02T13:55:27Z)
SmokeSeer: 3D Gaussian Splatting for Smoke Removal and Scene Reconstruction [14.475461616365346]
Smoke in real-world scenes can severely degrade the quality of images and hamper visibility.<n>We introduce SmokeSeer, a method for simultaneous 3D scene reconstruction and smoke removal from a video.<n>Our method uses thermal and RGB images, leveraging the fact that the reduced scattering in thermal images enables us to see through the smoke.
arXiv Detail & Related papers (2025-09-22T03:05:22Z)
SmokeBench: A Real-World Dataset for Surveillance Image Desmoking in Early-Stage Fire Scenes [8.183561852240851]
Smoke produced by combustion significantly reduces the visibility of surveillance systems.<n>There is an urgent need to remove smoke from images to obtain clear scene information.<n>We present a real-world surveillance image desmoking benchmark dataset named SmokeBench.
arXiv Detail & Related papers (2025-09-16T05:51:11Z)
WildSmoke: Ready-to-Use Dynamic 3D Smoke Assets from a Single Video in the Wild [15.941164647083696]
We propose a pipeline to extract and reconstruct dynamic 3D smoke assets from a single in-the-wild video.<n>Our method outperforms previous reconstruction and generation methods with high-quality smoke reconstructions.
arXiv Detail & Related papers (2025-09-14T06:06:42Z)
SelfHVD: Self-Supervised Handheld Video Deblurring for Mobile Phones [54.427316707517406]
We propose a self-supervised method for handheld video deblurring, driven by sharp clues in the video.<n>To train the deblurring model, we extract the sharp clues from the video and take them as misalignment labels of neighboring blurry frames.<n>We construct a synthetic and a real-world handheld video dataset for handheld video deblurring.
arXiv Detail & Related papers (2025-08-12T03:38:14Z)
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion [70.4360995984905]
We introduce Self Forcing, a novel training paradigm for autoregressive video diffusion models.<n>It addresses the longstanding issue of exposure bias, where models trained on ground-truth context must generate sequences conditioned on their own imperfect outputs.
arXiv Detail & Related papers (2025-06-09T17:59:55Z)
Temporal-Consistent Video Restoration with Pre-trained Diffusion Models [51.47188802535954]
Video restoration (VR) aims to recover high-quality videos from degraded ones. Recent zero-shot VR methods using pre-trained diffusion models (DMs) suffer from approximation errors during reverse diffusion and insufficient temporal consistency. We present a novel a Posterior Maximum (MAP) framework that directly parameterizes video frames in the seed space of DMs, eliminating approximation errors.
arXiv Detail & Related papers (2025-03-19T03:41:56Z)
TACO: Taming Diffusion for in-the-wild Video Amodal Completion [32.474824991167424]
This paper tackles the task of Video Amodal Completion (VAC), which aims to generate the complete object consistently throughout the video. We propose a conditional diffusion model, TACO, that repurposes pre-trained video diffusion models. We demonstrate TACO's versatility on a wide range of in-the-wild videos from Internet, as well as on diverse, unseen datasets commonly used in autonomous driving.
arXiv Detail & Related papers (2025-03-15T08:47:45Z)
ReCamMaster: Camera-Controlled Generative Rendering from A Single Video [72.42376733537925]
ReCamMaster is a camera-controlled generative video re-rendering framework.<n>It reproduces the dynamic scene of an input video at novel camera trajectories.<n>Our method also finds promising applications in video stabilization, super-resolution, and outpainting.
arXiv Detail & Related papers (2025-03-14T17:59:31Z)
Long Context Tuning for Video Generation [63.060794860098795]
Long Context Tuning (LCT) is a training paradigm that expands the context window of pre-trained single-shot video diffusion models. Our method expands full attention mechanisms from individual shots to encompass all shots within a scene. Experiments demonstrate coherent multi-shot scenes and exhibit emerging capabilities, including compositional generation and interactive shot extension.
arXiv Detail & Related papers (2025-03-13T17:40:07Z)
LSD3K: A Benchmark for Smoke Removal from Laparoscopic Surgery Images [0.7138611948315257]
Smoke generated by surgical instruments during laparoscopic surgery can obscure the visual field, impairing surgeons' ability to perform operations accurately and safely. Despite laparoscopic image desmoking has attracted the attention of researchers in recent years, the lack of publicly available high-quality benchmark datasets is the main bottleneck to hamper the development progress of this task. We construct a new high-quality dataset for Laparoscopic Surgery image Desmoking, named LSD3K, consisting of 3,000 paired synthetic non-homogeneous smoke images.
arXiv Detail & Related papers (2024-07-18T03:42:16Z)
WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models [132.77237314239025]
Video virtual try-on aims to generate realistic sequences that maintain garment identity and adapt to a person's pose and body shape in source videos. Traditional image-based methods, relying on warping and blending, struggle with complex human movements and occlusions. We reconceptualize video try-on as a process of generating videos conditioned on garment descriptions and human motion. Our solution, WildVidFit, employs image-based controlled diffusion models for a streamlined, one-stage approach.
arXiv Detail & Related papers (2024-07-15T11:21:03Z)
V-LASIK: Consistent Glasses-Removal from Videos Using Synthetic Data [20.23001319056999]
Diffusion-based generative models have recently shown remarkable image and video editing capabilities. We focus on consistent and identity-preserving removal of glasses in videos, using it as a case study for consistent local attribute removal in videos. We show that despite data imperfection, our model is able to perform the desired edit consistently while preserving the original video content.
arXiv Detail & Related papers (2024-06-20T17:14:43Z)
VGMShield: Mitigating Misuse of Video Generative Models [7.963591895964269]
We introduce VGMShield: a set of three straightforward but pioneering mitigations through the lifecycle of fake video generation. We first try to understand whether there is uniqueness in generated videos and whether we can differentiate them from real videos. Then, we investigate the textittracing problem, which maps a fake video back to a model that generates it.
arXiv Detail & Related papers (2024-02-20T16:39:23Z)
Vivim: a Video Vision Mamba for Medical Video Segmentation [52.11785024350253]
This paper presents a Video Vision Mamba-based framework, dubbed as Vivim, for medical video segmentation tasks. Our Vivim can effectively compress the long-term representation into sequences at varying scales. Experiments on thyroid segmentation, breast lesion segmentation in ultrasound videos, and polyp segmentation in colonoscopy videos demonstrate the effectiveness and efficiency of our Vivim.
arXiv Detail & Related papers (2024-01-25T13:27:03Z)
ScatterNeRF: Seeing Through Fog with Physically-Based Inverse Neural Rendering [83.75284107397003]
We introduce ScatterNeRF, a neural rendering method which renders scenes and decomposes the fog-free background. We propose a disentangled representation for the scattering volume and the scene objects, and learn the scene reconstruction with physics-inspired losses. We validate our method by capturing multi-view In-the-Wild data and controlled captures in a large-scale fog chamber.
arXiv Detail & Related papers (2023-05-03T13:24:06Z)
DiffDreamer: Towards Consistent Unsupervised Single-view Scene Extrapolation with Conditional Diffusion Models [91.94566873400277]
DiffDreamer is an unsupervised framework capable of synthesizing novel views depicting a long camera trajectory. We show that image-conditioned diffusion models can effectively perform long-range scene extrapolation while preserving consistency significantly better than prior GAN-based methods.
arXiv Detail & Related papers (2022-11-22T10:06:29Z)
Video-based Smoky Vehicle Detection with A Coarse-to-Fine Framework [20.74110691914317]
We introduce a real-world large-scale smoky vehicle dataset with 75,000 annotated smoky vehicle images. We also build a smoky vehicle video dataset including 163 long videos with segment-level annotations. We present a new Coarse-to-fine Deep Smoky vehicle detection framework for efficient smoky vehicle detection.
arXiv Detail & Related papers (2022-07-08T06:42:45Z)
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling [88.30109041658618]
A great challenge in video-language (VidL) modeling lies in the disconnection between fixed video representations extracted from image/video understanding models and downstream VidL data. We present VIOLET, a fully end-to-end VIdeO-LanguagE Transformer, which adopts a video transformer to explicitly model the temporal dynamics of video inputs.
arXiv Detail & Related papers (2021-11-24T18:31:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.