Related papers: RT-GAN: Recurrent Temporal GAN for Adding Lightweight Temporal Consistency to Frame-Based Domain Translation Approaches

RT-GAN: Recurrent Temporal GAN for Adding Lightweight Temporal Consistency to Frame-Based Domain Translation Approaches

URL: http://arxiv.org/abs/2310.00868v2
Date: Tue, 13 May 2025 16:31:47 GMT
Title: RT-GAN: Recurrent Temporal GAN for Adding Lightweight Temporal Consistency to Frame-Based Domain Translation Approaches
Authors: Shawn Mathew, Saad Nadeem, Alvin C. Goh, Arie Kaufman,
Abstract summary: We present a lightweight solution with a tunable temporal parameter, RT-GAN, for adding temporal consistency to individual frame-based approaches.<n>We demonstrate the effectiveness of our approach on two challenging use cases in colonoscopy.
Score: 3.7873597471903944
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Fourteen million colonoscopies are performed annually just in the U.S. However, the videos from these colonoscopies are not saved due to storage constraints (each video from a high-definition colonoscope camera can be in tens of gigabytes). Instead, a few relevant individual frames are saved for documentation/reporting purposes and these are the frames on which most current colonoscopy AI models are trained on. While developing new unsupervised domain translation methods for colonoscopy (e.g. to translate between real optical and virtual/CT colonoscopy), it is thus typical to start with approaches that initially work for individual frames without temporal consistency. Once an individual-frame model has been finalized, additional contiguous frames are added with a modified deep learning architecture to train a new model from scratch for temporal consistency. This transition to temporally-consistent deep learning models, however, requires significantly more computational and memory resources for training. In this paper, we present a lightweight solution with a tunable temporal parameter, RT-GAN (Recurrent Temporal GAN), for adding temporal consistency to individual frame-based approaches that reduces training requirements by a factor of 5. We demonstrate the effectiveness of our approach on two challenging use cases in colonoscopy: haustral fold segmentation (indicative of missed surface) and realistic colonoscopy simulator video generation. We also release a first-of-its kind temporal dataset for colonoscopy for the above use cases. The datasets, accompanying code, and pretrained models will be made available on our Computational Endoscopy Platform GitHub (https://github.com/nadeemlab/CEP). The supplementary video is available at https://youtu.be/UMVP-uIXwWk.

Related papers

Adapting Vision Foundation Models for Real-time Ultrasound Image Segmentation [20.009670139005085]
Existing ultrasound segmentation methods often struggle with adaptability to new tasks. We introduce an adaptive framework that leverages the vision foundation model Hiera to extract multi-scale features. These enriched features are then decoded to produce precise and robust segmentation.
arXiv Detail & Related papers (2025-03-31T17:47:42Z)
EndoMamba: An Efficient Foundation Model for Endoscopic Videos [2.747826950754128]
Endoscopic video-based tasks, such as visual navigation and surgical phase recognition, play a crucial role in minimally invasive surgeries by providing real-time assistance. Recent video foundation models have shown promise, but their applications are hindered by computational inefficiencies and suboptimal performance caused by limited data for training in endoscopy. To address these issues, we present EndoMamba, a foundation model model designed for real-time inference while incorporating generalized representations.
arXiv Detail & Related papers (2025-02-26T12:36:16Z)
A Temporal Convolutional Network-Based Approach and a Benchmark Dataset for Colonoscopy Video Temporal Segmentation [3.146247125118741]
ColonTCN is a learning-based architecture that employs custom temporal convolutional blocks to efficiently capture temporal dependencies for the temporal segmentation of colonoscopy videos.<n>ColonTCN achieves state-of-the-art performance in classification accuracy while maintaining a low parameter count when evaluated.<n>We believe that the proposed open-access benchmark and the ColonTCN approach represent a significant advancement in the temporal segmentation of colonoscopy procedures.
arXiv Detail & Related papers (2025-02-05T18:21:56Z)
WinTSR: A Windowed Temporal Saliency Rescaling Method for Interpreting Time Series Deep Learning Models [0.51795041186793]
We introduce a novel interpretation method, textitWindowed Temporal Saliency Rescaling (WinTSR) We benchmark WinTSR against 10 recent interpretation techniques with 5 state-of-the-art deep-learning models of different architectures. Our comprehensive analysis shows that WinTSR significantly outperforms other local interpretation methods in overall performance.
arXiv Detail & Related papers (2024-12-05T17:15:07Z)
STLight: a Fully Convolutional Approach for Efficient Predictive Learning by Spatio-Temporal joint Processing [6.872340834265972]
We propose STLight, a novel method for S-temporal learning that relies solely on channel-wise and depth-wise convolutions as learnable layers. STLight overcomes the limitations of traditional convolutional approaches by rearranging spatial and temporal dimensions together. Our architecture achieves state-of-the-art performance on STL benchmarks across datasets and settings, while significantly improving computational efficiency in terms of parameters and computational FLOPs.
arXiv Detail & Related papers (2024-11-15T13:53:19Z)
Cross Space and Time: A Spatio-Temporal Unitized Model for Traffic Flow Forecasting [16.782154479264126]
Predicting backbone-temporal traffic flow presents challenges due to complex interactions between temporal factors. Existing approaches address these dimensions in isolation, neglecting their critical interdependencies. In this paper, we introduce Sanonymous-Temporal Unitized Unitized Cell (ASTUC), a unified framework designed to capture both spatial and temporal dependencies.
arXiv Detail & Related papers (2024-11-14T07:34:31Z)
Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion [57.232688209606515]
We present HTCL, a novel Temporal Temporal Context Learning paradigm for improving camera-based semantic scene completion. Our method ranks $1st$ on the Semantic KITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU.
arXiv Detail & Related papers (2024-07-02T09:11:17Z)
SSTFB: Leveraging self-supervised pretext learning and temporal self-attention with feature branching for real-time video polyp segmentation [4.027361638728112]
We propose a video polyp segmentation method that performs self-supervised learning as an auxiliary task and a spatial-temporal self-attention mechanism for improved representation learning. Our experimental results demonstrate an improvement with respect to several state-of-the-art (SOTA) methods. Our ablation study confirms that the choice of the proposed joint end-to-end training improves network accuracy by over 3% and nearly 10% on both the Dice similarity coefficient and intersection-over-union.
arXiv Detail & Related papers (2024-06-14T17:33:11Z)
Self-STORM: Deep Unrolled Self-Supervised Learning for Super-Resolution Microscopy [55.2480439325792]
We introduce deep unrolled self-supervised learning, which alleviates the need for such data by training a sequence-specific, model-based autoencoder. Our proposed method exceeds the performance of its supervised counterparts.
arXiv Detail & Related papers (2024-03-25T17:40:32Z)
MeVGAN: GAN-based Plugin Model for Video Generation with Applications in Colonoscopy [12.515404169717451]
We propose Memory Efficient Video GAN (MeVGAN) - a Geneversarative Adrial Network (GAN) We use a pre-trained 2D-image GAN to construct respective trajectories in the noise space, so that the trajectory forwarded through the GAN model constructs a real-life video. We show that MeVGAN can produce good quality synthetic colonoscopy videos, which can be potentially used in virtual simulators.
arXiv Detail & Related papers (2023-11-07T10:58:16Z)
Time-series Generation by Contrastive Imitation [87.51882102248395]
We study a generative framework that seeks to combine the strengths of both: Motivated by a moment-matching objective to mitigate compounding error, we optimize a local (but forward-looking) transition policy. At inference, the learned policy serves as the generator for iterative sampling, and the learned energy serves as a trajectory-level measure for evaluating sample quality.
arXiv Detail & Related papers (2023-11-02T16:45:25Z)
Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning [59.26623999209235]
We present DiST, which disentangles the learning of spatial and temporal aspects of videos. The disentangled learning in DiST is highly efficient because it avoids the back-propagation of massive pre-trained parameters. Extensive experiments on five benchmarks show that DiST delivers better performance than existing state-of-the-art methods by convincing gaps.
arXiv Detail & Related papers (2023-09-14T17:58:33Z)
Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation [92.55296042611886]
We propose a framework called "Reuse and Diffuse" dubbed $textitVidRD$ to produce more frames following the frames already generated by an LDM. We also propose a set of strategies for composing video-text data that involve diverse content from multiple existing datasets.
arXiv Detail & Related papers (2023-09-07T08:12:58Z)
YONA: You Only Need One Adjacent Reference-frame for Accurate and Fast Video Polyp Detection [80.68520401539979]
textbfYONA (textbfYou textbfOnly textbfNeed one textbfAdjacent Reference-frame) is an efficient end-to-end training framework for video polyp detection. Our proposed YONA outperforms previous state-of-the-art competitors by a large margin in both accuracy and speed.
arXiv Detail & Related papers (2023-06-06T13:53:15Z)
Transform-Equivariant Consistency Learning for Temporal Sentence Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video. Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted. In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z)
Towards Smooth Video Composition [59.134911550142455]
Video generation requires consistent and persistent frames with dynamic content over time. This work investigates modeling the temporal relations for composing video with arbitrary length, from a few frames to even infinite, using generative adversarial networks (GANs) We show that the alias-free operation for single image generation, together with adequately pre-learned knowledge, brings a smooth frame transition without compromising the per-frame quality.
arXiv Detail & Related papers (2022-12-14T18:54:13Z)
Fast Non-Rigid Radiance Fields from Monocularized Data [66.74229489512683]
This paper proposes a new method for full 360deg inward-facing novel view synthesis of non-rigidly deforming scenes. At the core of our method are 1) An efficient deformation module that decouples the processing of spatial and temporal information for accelerated training and inference; and 2) A static module representing the canonical scene as a fast hash-encoded neural radiance field. In both cases, our method is significantly faster than previous methods, converging in less than 7 minutes and achieving real-time framerates at 1K resolution, while obtaining a higher visual accuracy for generated novel views.
arXiv Detail & Related papers (2022-12-02T18:51:10Z)
CLTS-GAN: Color-Lighting-Texture-Specular Reflection Augmentation for Colonoscopy [5.298287413134345]
CLTS-GAN is a new deep learning model that gives fine control over color, lighting, texture, and specular reflection for OC video frames. We show that adding colonoscopy-specific augmentations to the training data can improve state-of-the-art polyp detection/segmentation methods.
arXiv Detail & Related papers (2022-06-29T23:51:16Z)
Unsupervised Shot Boundary Detection for Temporal Segmentation of Long Capsule Endoscopy Videos [0.0]
Physicians use Capsule Endoscopy (CE) as a non-invasive and non-surgical procedure to examine the entire gastrointestinal (GI) tract. A single CE examination could last between 8 to 11 hours generating up to 80,000 frames which is compiled as a video.
arXiv Detail & Related papers (2021-10-18T07:22:46Z)
StyleVideoGAN: A Temporal Generative Model using a Pretrained StyleGAN [70.31913835035206]
We present a novel approach to the video synthesis problem that helps to greatly improve visual quality. We make use of a pre-trained StyleGAN network, the latent space of which allows control over the appearance of the objects it was trained for. Our temporal architecture is then trained not on sequences of RGB frames, but on sequences of StyleGAN latent codes.
arXiv Detail & Related papers (2021-07-15T09:58:15Z)
Colonoscopy Polyp Detection: Domain Adaptation From Medical Report Images to Real-time Videos [76.37907640271806]
We propose an Image-video-joint polyp detection network (Ivy-Net) to address the domain gap between colonoscopy images from historical medical reports and real-time videos. Experiments on the collected dataset demonstrate that our Ivy-Net achieves the state-of-the-art result on colonoscopy video.
arXiv Detail & Related papers (2020-12-31T10:33:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.