REMOTE: Real-time Ego-motion Tracking for Various Endoscopes via Multimodal Visual Feature Learning
- URL: http://arxiv.org/abs/2501.18124v2
- Date: Sun, 02 Feb 2025 14:32:01 GMT
- Title: REMOTE: Real-time Ego-motion Tracking for Various Endoscopes via Multimodal Visual Feature Learning
- Authors: Liangjing Shao, Benshuang Chen, Shuting Zhao, Xinrong Chen,
- Abstract summary: A novel framework is proposed to perform real-time ego-motion tracking for endoscope.
A multi-modal visual feature learning network is proposed to perform relative pose prediction.
The absolute pose of endoscope is calculated based on relative poses.
- Score: 0.7499722271664147
- License:
- Abstract: Real-time ego-motion tracking for endoscope is a significant task for efficient navigation and robotic automation of endoscopy. In this paper, a novel framework is proposed to perform real-time ego-motion tracking for endoscope. Firstly, a multi-modal visual feature learning network is proposed to perform relative pose prediction, in which the motion feature from the optical flow, the scene features and the joint feature from two adjacent observations are all extracted for prediction. Due to more correlation information in the channel dimension of the concatenated image, a novel feature extractor is designed based on an attention mechanism to integrate multi-dimensional information from the concatenation of two continuous frames. To extract more complete feature representation from the fused features, a novel pose decoder is proposed to predict the pose transformation from the concatenated feature map at the end of the framework. At last, the absolute pose of endoscope is calculated based on relative poses. The experiment is conducted on three datasets of various endoscopic scenes and the results demonstrate that the proposed method outperforms state-of-the-art methods. Besides, the inference speed of the proposed method is over 30 frames per second, which meets the real-time requirement. The project page is here: remote-bmxs.netlify.app
Related papers
- H-Net: A Multitask Architecture for Simultaneous 3D Force Estimation and Stereo Semantic Segmentation in Intracardiac Catheters [0.0]
Vision-based deep learning models can deliver both tactile and visual information in a sensor-free manner.
There is a lack of a comprehensive architecture capable of simultaneously segmenting the catheter from two different angles.
This work proposes a novel, lightweight, multi-input, multi-output encoder-decoder-based architecture.
arXiv Detail & Related papers (2024-12-31T15:55:13Z) - LACOSTE: Exploiting stereo and temporal contexts for surgical instrument segmentation [14.152207010509763]
We propose a novel LACOSTE model that exploits Location-Agnostic COntexts in Stereo and TEmporal images for improved surgical instrument segmentation.
We extensively validate our approach on three public surgical video datasets.
arXiv Detail & Related papers (2024-09-14T08:17:56Z) - DD-VNB: A Depth-based Dual-Loop Framework for Real-time Visually Navigated Bronchoscopy [5.8722774441994074]
We propose a Depth-based Dual-Loop framework for real-time Visually Navigated Bronchoscopy (DD-VNB)
The DD-VNB framework integrates two key modules: depth estimation and dual-loop localization.
Experiments on phantom and in-vivo data from patients demonstrate the effectiveness of our framework.
arXiv Detail & Related papers (2024-03-04T02:29:02Z) - Neuromorphic Synergy for Video Binarization [54.195375576583864]
Bimodal objects serve as a visual form to embed information that can be easily recognized by vision systems.
Neuromorphic cameras offer new capabilities for alleviating motion blur, but it is non-trivial to first de-blur and then binarize the images in a real-time manner.
We propose an event-based binary reconstruction method that leverages the prior knowledge of the bimodal target's properties to perform inference independently in both event space and image space.
We also develop an efficient integration method to propagate this binary image to high frame rate binary video.
arXiv Detail & Related papers (2024-02-20T01:43:51Z) - Modeling Continuous Motion for 3D Point Cloud Object Tracking [54.48716096286417]
This paper presents a novel approach that views each tracklet as a continuous stream.
At each timestamp, only the current frame is fed into the network to interact with multi-frame historical features stored in a memory bank.
To enhance the utilization of multi-frame features for robust tracking, a contrastive sequence enhancement strategy is proposed.
arXiv Detail & Related papers (2023-03-14T02:58:27Z) - BEVerse: Unified Perception and Prediction in Birds-Eye-View for
Vision-Centric Autonomous Driving [92.05963633802979]
We present BEVerse, a unified framework for 3D perception and prediction based on multi-camera systems.
We show that the multi-task BEVerse outperforms single-task methods on 3D object detection, semantic map construction, and motion prediction.
arXiv Detail & Related papers (2022-05-19T17:55:35Z) - A Spatio-Temporal Multilayer Perceptron for Gesture Recognition [70.34489104710366]
We propose a multilayer state-weighted perceptron for gesture recognition in the context of autonomous vehicles.
An evaluation of TCG and Drive&Act datasets is provided to showcase the promising performance of our approach.
We deploy our model to our autonomous vehicle to show its real-time capability and stable execution.
arXiv Detail & Related papers (2022-04-25T08:42:47Z) - Spatio-Temporal Self-Attention Network for Video Saliency Prediction [13.873682190242365]
3D convolutional neural networks have achieved promising results for video tasks in computer vision.
We propose a novel Spatio-Temporal Self-Temporal Self-Attention 3 Network (STSANet) for video saliency prediction.
arXiv Detail & Related papers (2021-08-24T12:52:47Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - Attention-Driven Body Pose Encoding for Human Activity Recognition [0.0]
This article proposes a novel attention-based body pose encoding for human activity recognition.
The enriched data complements the 3D body joint position data and improves model performance.
arXiv Detail & Related papers (2020-09-29T22:17:17Z) - Motion-Attentive Transition for Zero-Shot Video Object Segmentation [99.44383412488703]
We present a Motion-Attentive Transition Network (MATNet) for zero-shot object segmentation.
An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder.
In this way, the encoder becomes deeply internative, allowing for closely hierarchical interactions between object motion and appearance.
arXiv Detail & Related papers (2020-03-09T16:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.