Multi-frame Collaboration for Effective Endoscopic Video Polyp Detection
via Spatial-Temporal Feature Transformation
- URL: http://arxiv.org/abs/2107.03609v1
- Date: Thu, 8 Jul 2021 05:17:30 GMT
- Title: Multi-frame Collaboration for Effective Endoscopic Video Polyp Detection
via Spatial-Temporal Feature Transformation
- Authors: Lingyun Wu, Zhiqiang Hu, Yuanfeng Ji, Ping Luo, Shaoting Zhang
- Abstract summary: We present Spatial-Temporal Feature Transformation (STFT), a multi-frame collaborative framework to address these issues.
For example, STFT mitigates inter-frame variations in the camera-moving situation with feature alignment by proposal-guided deformable convolutions.
Empirical studies and superior results demonstrate the effectiveness and stability of our method.
- Score: 28.01363432141765
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Precise localization of polyp is crucial for early cancer screening in
gastrointestinal endoscopy. Videos given by endoscopy bring both richer
contextual information as well as more challenges than still images. The
camera-moving situation, instead of the common camera-fixed-object-moving one,
leads to significant background variation between frames. Severe internal
artifacts (e.g. water flow in the human body, specular reflection by tissues)
can make the quality of adjacent frames vary considerately. These factors
hinder a video-based model to effectively aggregate features from neighborhood
frames and give better predictions. In this paper, we present Spatial-Temporal
Feature Transformation (STFT), a multi-frame collaborative framework to address
these issues. Spatially, STFT mitigates inter-frame variations in the
camera-moving situation with feature alignment by proposal-guided deformable
convolutions. Temporally, STFT proposes a channel-aware attention module to
simultaneously estimate the quality and correlation of adjacent frames for
adaptive feature aggregation. Empirical studies and superior results
demonstrate the effectiveness and stability of our method. For example, STFT
improves the still image baseline FCOS by 10.6% and 20.6% on the comprehensive
F1-score of the polyp localization task in CVC-Clinic and ASUMayo datasets,
respectively, and outperforms the state-of-the-art video-based method by 3.6%
and 8.0%, respectively. Code is available at
\url{https://github.com/lingyunwu14/STFT}.
Related papers
- Automatic Image Unfolding and Stitching Framework for Esophageal Lining Video Based on Density-Weighted Feature Matching [6.995909617361624]
This paper introduces a novel automatic image unfolding and stitching framework tailored for esophageal videos captured during endoscopy.
The framework achieves low Root Mean Square Structural Error (RMSE) and high Similarity Index (SSIM) across extensive video sequences.
arXiv Detail & Related papers (2024-10-02T00:53:48Z) - SSTFB: Leveraging self-supervised pretext learning and temporal self-attention with feature branching for real-time video polyp segmentation [4.027361638728112]
We propose a video polyp segmentation method that performs self-supervised learning as an auxiliary task and a spatial-temporal self-attention mechanism for improved representation learning.
Our experimental results demonstrate an improvement with respect to several state-of-the-art (SOTA) methods.
Our ablation study confirms that the choice of the proposed joint end-to-end training improves network accuracy by over 3% and nearly 10% on both the Dice similarity coefficient and intersection-over-union.
arXiv Detail & Related papers (2024-06-14T17:33:11Z) - Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images [68.42215385041114]
This paper introduces a novel lightweight multi-level adaptation and comparison framework to repurpose the CLIP model for medical anomaly detection.
Our approach integrates multiple residual adapters into the pre-trained visual encoder, enabling a stepwise enhancement of visual features across different levels.
Our experiments on medical anomaly detection benchmarks demonstrate that our method significantly surpasses current state-of-the-art models.
arXiv Detail & Related papers (2024-03-19T09:28:19Z) - YONA: You Only Need One Adjacent Reference-frame for Accurate and Fast
Video Polyp Detection [80.68520401539979]
textbfYONA (textbfYou textbfOnly textbfNeed one textbfAdjacent Reference-frame) is an efficient end-to-end training framework for video polyp detection.
Our proposed YONA outperforms previous state-of-the-art competitors by a large margin in both accuracy and speed.
arXiv Detail & Related papers (2023-06-06T13:53:15Z) - Hierarchical Transformer for Survival Prediction Using Multimodality
Whole Slide Images and Genomics [63.76637479503006]
Learning good representation of giga-pixel level whole slide pathology images (WSI) for downstream tasks is critical.
This paper proposes a hierarchical-based multimodal transformer framework that learns a hierarchical mapping between pathology images and corresponding genes.
Our architecture requires fewer GPU resources compared with benchmark methods while maintaining better WSI representation ability.
arXiv Detail & Related papers (2022-11-29T23:47:56Z) - Affinity Feature Strengthening for Accurate, Complete and Robust Vessel
Segmentation [48.638327652506284]
Vessel segmentation is crucial in many medical image applications, such as detecting coronary stenoses, retinal vessel diseases and brain aneurysms.
We present a novel approach, the affinity feature strengthening network (AFN), which jointly models geometry and refines pixel-wise segmentation features using a contrast-insensitive, multiscale affinity approach.
arXiv Detail & Related papers (2022-11-12T05:39:17Z) - Video-TransUNet: Temporally Blended Vision Transformer for CT VFSS
Instance Segmentation [11.575821326313607]
We propose Video-TransUNet, a deep architecture for segmentation in medical CT videos constructed by integrating temporal feature blending into the TransUNet deep learning framework.
In particular, our approach amalgamates strong frame representation via a ResNet CNN backbone, multi-frame feature blending via a Temporal Context Module, and reconstructive capabilities for multiple targets via a UNet-based convolutional-deconal architecture with multiple heads.
arXiv Detail & Related papers (2022-08-17T14:28:58Z) - Retrieval of surgical phase transitions using reinforcement learning [11.130363429095048]
We introduce a novel reinforcement learning formulation for offline phase transition retrieval.
By construction, our model does not produce spurious and noisy phase transitions, but contiguous phase blocks.
We compare our method against the recent top-performing frame-based approaches TeCNO and Trans-SVNet.
arXiv Detail & Related papers (2022-08-01T14:43:15Z) - GPU-accelerated SIFT-aided source identification of stabilized videos [63.084540168532065]
We exploit the parallelization capabilities of Graphics Processing Units (GPUs) in the framework of stabilised frames inversion.
We propose to exploit SIFT features.
to estimate the camera momentum and %to identify less stabilized temporal segments.
Experiments confirm the effectiveness of the proposed approach in reducing the required computational time and improving the source identification accuracy.
arXiv Detail & Related papers (2022-07-29T07:01:31Z) - DisCoVQA: Temporal Distortion-Content Transformers for Video Quality
Assessment [56.42140467085586]
Some temporal variations are causing temporal distortions and lead to extra quality degradations.
Human visual system often has different attention to frames with different contents.
We propose a novel and effective transformer-based VQA method to tackle these two issues.
arXiv Detail & Related papers (2022-06-20T15:31:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.