Related papers: A Spatial-Temporal Deformable Attention based Framework for Breast Lesion Detection in Videos

A Spatial-Temporal Deformable Attention based Framework for Breast Lesion Detection in Videos

URL: http://arxiv.org/abs/2309.04702v1
Date: Sat, 9 Sep 2023 07:00:10 GMT
Title: A Spatial-Temporal Deformable Attention based Framework for Breast Lesion Detection in Videos
Authors: Chao Qin and Jiale Cao and Huazhu Fu and Rao Muhammad Anwer and Fahad Shahbaz Khan
Abstract summary: We propose a spatial-temporal deformable attention based framework, named STNet. Our STNet introduces a spatial-temporal deformable attention module to perform local spatial-temporal feature fusion. Experiments on the public breast lesion ultrasound video dataset show that our STNet obtains a state-of-the-art detection performance.
Score: 107.96514633713034
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Detecting breast lesion in videos is crucial for computer-aided diagnosis. Existing video-based breast lesion detection approaches typically perform temporal feature aggregation of deep backbone features based on the self-attention operation. We argue that such a strategy struggles to effectively perform deep feature aggregation and ignores the useful local information. To tackle these issues, we propose a spatial-temporal deformable attention based framework, named STNet. Our STNet introduces a spatial-temporal deformable attention module to perform local spatial-temporal feature fusion. The spatial-temporal deformable attention module enables deep feature aggregation in each stage of both encoder and decoder. To further accelerate the detection speed, we introduce an encoder feature shuffle strategy for multi-frame prediction during inference. In our encoder feature shuffle strategy, we share the backbone and encoder features, and shuffle encoder features for decoder to generate the predictions of multiple frames. The experiments on the public breast lesion ultrasound video dataset show that our STNet obtains a state-of-the-art detection performance, while operating twice as fast inference speed. The code and model are available at https://github.com/AlfredQin/STNet.

Related papers

Deepfake Detection with Spatio-Temporal Consistency and Attention [46.1135899490656]
Deepfake videos are causing growing concerns among communities due to their ever-increasing realism. Current methods for detecting forged videos rely mainly on global frame features. We propose a neural Deepfake detector that focuses on the localized manipulative signatures of the forged videos.
arXiv Detail & Related papers (2025-02-12T08:51:33Z)
Skeleton-Guided Spatial-Temporal Feature Learning for Video-Based Visible-Infrared Person Re-Identification [2.623742123778503]
Video-based visible-infrared person re-identification (VVI-ReID) is challenging due to significant modality feature discrepancies. We propose a novel Skeleton-guided spatial-temporal feAture leaRning (STAR) method for VVI-ReID.
arXiv Detail & Related papers (2024-11-17T13:18:05Z)
TSdetector: Temporal-Spatial Self-correction Collaborative Learning for Colonoscopy Video Detection [19.00902297385955]
We propose a novel Temporal-Spatial self-correction detector (TSdetector), which integrates temporal-level consistency learning and spatial-level reliability learning to detect objects continuously. The experimental results on three publicly available polyp video dataset show that TSdetector achieves the highest polyp detection rate and outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2024-09-30T06:19:29Z)
Weakly Supervised Video Anomaly Detection and Localization with Spatio-Temporal Prompts [57.01985221057047]
This paper introduces a novel method that learnstemporal prompt embeddings for weakly supervised video anomaly detection and localization (WSVADL) based on pre-trained vision-language models (VLMs) Our method achieves state-of-theart performance on three public benchmarks for the WSVADL task.
arXiv Detail & Related papers (2024-08-12T03:31:29Z)
Point Cloud Video Anomaly Detection Based on Point Spatio-Temporal Auto-Encoder [1.4340883856076097]
We propose Point Spatio-Temporal Auto-Encoder (PSTAE), an autoencoder framework that uses point cloud videos as input to detect anomalies in point cloud videos. Our method sets a new state-of-the-art (SOTA) on the TIMo dataset.
arXiv Detail & Related papers (2023-06-04T10:30:28Z)
You Can Ground Earlier than See: An Effective and Efficient Pipeline for Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query. Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames. We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z)
Pedestrian Spatio-Temporal Information Fusion For Video Anomaly Detection [1.5736899098702974]
An anomaly detection method is proposed to integrate the information of pedestrians. Anomaly detection is realized according to the difference between the output frame and the true value. The experimental results on the CUHK Avenue and ShanghaiTech datasets show that the proposed method is superior to the current mainstream video anomaly detection methods.
arXiv Detail & Related papers (2022-11-18T06:41:02Z)
Video Salient Object Detection via Contrastive Features and Attention Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection. A co-attention formulation is utilized to combine the low-level and high-level features. We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z)
Deep Video Inpainting Detection [95.36819088529622]
Video inpainting detection localizes an inpainted region in a video both spatially and temporally. VIDNet, Video Inpainting Detection Network, contains a two-stream encoder-decoder architecture with attention module.
arXiv Detail & Related papers (2021-01-26T20:53:49Z)
DS-Net: Dynamic Spatiotemporal Network for Video Salient Object Detection [78.04869214450963]
We propose a novel dynamic temporal-temporal network (DSNet) for more effective fusion of temporal and spatial information. We show that the proposed method achieves superior performance than state-of-the-art algorithms.
arXiv Detail & Related papers (2020-12-09T06:42:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.