YONA: You Only Need One Adjacent Reference-frame for Accurate and Fast
Video Polyp Detection
- URL: http://arxiv.org/abs/2306.03686v2
- Date: Sun, 30 Jul 2023 14:14:38 GMT
- Title: YONA: You Only Need One Adjacent Reference-frame for Accurate and Fast
Video Polyp Detection
- Authors: Yuncheng Jiang, Zixun Zhang, Ruimao Zhang, Guanbin Li, Shuguang Cui,
Zhen Li
- Abstract summary: textbfYONA (textbfYou textbfOnly textbfNeed one textbfAdjacent Reference-frame) is an efficient end-to-end training framework for video polyp detection.
Our proposed YONA outperforms previous state-of-the-art competitors by a large margin in both accuracy and speed.
- Score: 80.68520401539979
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurate polyp detection is essential for assisting clinical rectal cancer
diagnoses. Colonoscopy videos contain richer information than still images,
making them a valuable resource for deep learning methods. Great efforts have
been made to conduct video polyp detection through multi-frame temporal/spatial
aggregation. However, unlike common fixed-camera video, the camera-moving scene
in colonoscopy videos can cause rapid video jitters, leading to unstable
training for existing video detection models. Additionally, the concealed
nature of some polyps and the complex background environment further hinder the
performance of existing video detectors. In this paper, we propose the
\textbf{YONA} (\textbf{Y}ou \textbf{O}nly \textbf{N}eed one \textbf{A}djacent
Reference-frame) method, an efficient end-to-end training framework for video
polyp detection. YONA fully exploits the information of one previous adjacent
frame and conducts polyp detection on the current frame without multi-frame
collaborations. Specifically, for the foreground, YONA adaptively aligns the
current frame's channel activation patterns with its adjacent reference frames
according to their foreground similarity. For the background, YONA conducts
background dynamic alignment guided by inter-frame difference to eliminate the
invalid features produced by drastic spatial jitters. Moreover, YONA applies
cross-frame contrastive learning during training, leveraging the ground truth
bounding box to improve the model's perception of polyp and background.
Quantitative and qualitative experiments on three public challenging benchmarks
demonstrate that our proposed YONA outperforms previous state-of-the-art
competitors by a large margin in both accuracy and speed.
Related papers
- SSTFB: Leveraging self-supervised pretext learning and temporal self-attention with feature branching for real-time video polyp segmentation [4.027361638728112]
We propose a video polyp segmentation method that performs self-supervised learning as an auxiliary task and a spatial-temporal self-attention mechanism for improved representation learning.
Our experimental results demonstrate an improvement with respect to several state-of-the-art (SOTA) methods.
Our ablation study confirms that the choice of the proposed joint end-to-end training improves network accuracy by over 3% and nearly 10% on both the Dice similarity coefficient and intersection-over-union.
arXiv Detail & Related papers (2024-06-14T17:33:11Z) - RIGID: Recurrent GAN Inversion and Editing of Real Face Videos [73.97520691413006]
GAN inversion is indispensable for applying the powerful editability of GAN to real images.
Existing methods invert video frames individually often leading to undesired inconsistent results over time.
We propose a unified recurrent framework, named textbfRecurrent vtextbfIdeo textbfGAN textbfInversion and etextbfDiting (RIGID)
Our framework learns the inherent coherence between input frames in an end-to-end manner.
arXiv Detail & Related papers (2023-08-11T12:17:24Z) - Accurate Real-time Polyp Detection in Videos from Concatenation of
Latent Features Extracted from Consecutive Frames [5.2009074009536524]
Convolutional neural networks (CNNs) are vulnerable to small changes in the input image.
A CNN-based model may miss the same polyp appearing in a series of consecutive frames.
We propose an efficient feature concatenation method for a CNN-based encoder-decoder model.
arXiv Detail & Related papers (2023-03-10T11:51:22Z) - Contrastive Transformer-based Multiple Instance Learning for Weakly
Supervised Polyp Frame Detection [30.51410140271929]
Current polyp detection methods from colonoscopy videos use exclusively normal (i.e., healthy) training images.
We formulate polyp detection as a weakly-supervised anomaly detection task that uses video-level labelled training data to detect frame-level polyps.
arXiv Detail & Related papers (2022-03-23T01:30:48Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - Multi-frame Collaboration for Effective Endoscopic Video Polyp Detection
via Spatial-Temporal Feature Transformation [28.01363432141765]
We present Spatial-Temporal Feature Transformation (STFT), a multi-frame collaborative framework to address these issues.
For example, STFT mitigates inter-frame variations in the camera-moving situation with feature alignment by proposal-guided deformable convolutions.
Empirical studies and superior results demonstrate the effectiveness and stability of our method.
arXiv Detail & Related papers (2021-07-08T05:17:30Z) - ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction
Detection in Videos [91.29436920371003]
We propose a simple yet effective architecture named Spatial-Temporal HOI Detection (ST-HOI)
We use temporal information such as human and object trajectories, correctly-localized visual features, and spatial-temporal masking pose features.
We construct a new video HOI benchmark dubbed VidHOI where our proposed approach serves as a solid baseline.
arXiv Detail & Related papers (2021-05-25T07:54:35Z) - Frame-rate Up-conversion Detection Based on Convolutional Neural Network
for Learning Spatiotemporal Features [7.895528973776606]
This paper proposes a frame-rate conversion detection network (FCDNet) that learns forensic features caused by FRUC in an end-to-end fashion.
FCDNet uses a stack of consecutive frames as the input and effectively learns artifacts using network blocks to learn features.
arXiv Detail & Related papers (2021-03-25T08:47:46Z) - Colonoscopy Polyp Detection: Domain Adaptation From Medical Report
Images to Real-time Videos [76.37907640271806]
We propose an Image-video-joint polyp detection network (Ivy-Net) to address the domain gap between colonoscopy images from historical medical reports and real-time videos.
Experiments on the collected dataset demonstrate that our Ivy-Net achieves the state-of-the-art result on colonoscopy video.
arXiv Detail & Related papers (2020-12-31T10:33:09Z) - Robust Unsupervised Video Anomaly Detection by Multi-Path Frame
Prediction [61.17654438176999]
We propose a novel and robust unsupervised video anomaly detection method by frame prediction with proper design.
Our proposed method obtains the frame-level AUROC score of 88.3% on the CUHK Avenue dataset.
arXiv Detail & Related papers (2020-11-05T11:34:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.