Related papers: TransForSeg: A Multitask Stereo ViT for Joint Stereo Segmentation and 3D Force Estimation in Catheterization

TransForSeg: A Multitask Stereo ViT for Joint Stereo Segmentation and 3D Force Estimation in Catheterization

URL: http://arxiv.org/abs/2509.01605v1
Date: Mon, 01 Sep 2025 16:36:23 GMT
Title: TransForSeg: A Multitask Stereo ViT for Joint Stereo Segmentation and 3D Force Estimation in Catheterization
Authors: Pedram Fekri, Mehrdad Zadeh, Javad Dargahi,
Abstract summary: We propose a novel encoder-decoder Vision Transformer model that processes two input X-ray images as separate sequences.<n>The proposed model is a stereo Vision Transformer capable of simultaneously segmenting the catheter from two angles while estimating the generated forces at its tip in 3D.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, the emergence of multitask deep learning models has enhanced catheterization procedures by providing tactile and visual perception data through an end-to-end architec- ture. This information is derived from a segmentation and force estimation head, which localizes the catheter in X-ray images and estimates the applied pressure based on its deflection within the image. These stereo vision architectures incorporate a CNN- based encoder-decoder that captures the dependencies between X-ray images from two viewpoints, enabling simultaneous 3D force estimation and stereo segmentation of the catheter. With these tasks in mind, this work approaches the problem from a new perspective. We propose a novel encoder-decoder Vision Transformer model that processes two input X-ray images as separate sequences. Given sequences of X-ray patches from two perspectives, the transformer captures long-range dependencies without the need to gradually expand the receptive field for either image. The embeddings generated by both the encoder and decoder are fed into two shared segmentation heads, while a regression head employs the fused information from the decoder for 3D force estimation. The proposed model is a stereo Vision Transformer capable of simultaneously segmenting the catheter from two angles while estimating the generated forces at its tip in 3D. This model has undergone extensive experiments on synthetic X-ray images with various noise levels and has been compared against state-of-the-art pure segmentation models, vision-based catheter force estimation methods, and a multitask catheter segmentation and force estimation approach. It outperforms existing models, setting a new state-of-the-art in both catheter segmentation and force estimation.

Related papers

Structured Spectral Graph Representation Learning for Multi-label Abnormality Analysis from 3D CT Scans [0.0]
Multi-label classification of 3D Chest CT scans remains a critical yet challenging problem.<n>Existing methods based on 3D convolutional neural networks struggle to capture long-range dependencies.<n>We propose a new graph-based framework that represents 3D CT volumes as structured graphs.
arXiv Detail & Related papers (2025-10-12T19:49:51Z)
Accelerating 3D Photoacoustic Computed Tomography with End-to-End Physics-Aware Neural Operators [74.65171736966131]
Photoacoustic computed tomography (PACT) combines optical contrast with ultrasonic resolution, achieving deep-tissue imaging beyond the optical diffusion limit.<n>Current implementations require dense transducer arrays and prolonged acquisition times, limiting clinical translation.<n>We introduce Pano, an end-to-end physics-aware model that directly learns the inverse acoustic mapping from sensor measurements to volumetric reconstructions.
arXiv Detail & Related papers (2025-09-11T23:12:55Z)
DVG-Diffusion: Dual-View Guided Diffusion Model for CT Reconstruction from X-Rays [32.55527512602604]
We facilitate complex 2D X-ray image to 3D CT mapping by incorporating new view synthesis, and reduce the learning difficulty through view-guided feature alignment.<n>Specifically, we propose a dual-view guided diffusion model (DVG-Diffusion), which couples a real input X-ray view and a synthesized new X-ray view to jointly guide CT reconstruction.
arXiv Detail & Related papers (2025-03-22T16:03:18Z)
CeViT: Copula-Enhanced Vision Transformer in multi-task learning and bi-group image covariates with an application to myopia screening [9.928208927136874]
We present a Vision Transformer-based bi-channel architecture, named CeViT, where the common features of a pair of eyes are extracted via a shared Transformer encoder.<n>We demonstrate that CeViT enhances the baseline model in both accuracy of classifying high-myopia and prediction of AL on both eyes.
arXiv Detail & Related papers (2025-01-11T13:23:56Z)
Swin-X2S: Reconstructing 3D Shape from 2D Biplanar X-ray with Swin Transformers [8.357602965532923]
Swin-X2S is an end-to-end deep learning method for reconstructing 3D segmentation and labeling from 2D X-ray images.<n>A dimension-expanding module is introduced to bridge the encoder and decoder, ensuring a smooth conversion from 2D pixels to 3D voxels.
arXiv Detail & Related papers (2025-01-10T13:41:10Z)
H-Net: A Multitask Architecture for Simultaneous 3D Force Estimation and Stereo Semantic Segmentation in Intracardiac Catheters [0.0]
Vision-based deep learning models can deliver both tactile and visual information in a sensor-free manner.<n>There is a lack of a comprehensive architecture capable of simultaneously segmenting the catheter from two different angles.<n>This work proposes a novel, lightweight, multi-input, multi-output encoder-decoder-based architecture.
arXiv Detail & Related papers (2024-12-31T15:55:13Z)
Intraoperative Registration by Cross-Modal Inverse Neural Rendering [61.687068931599846]
We present a novel approach for 3D/2D intraoperative registration during neurosurgery via cross-modal inverse neural rendering. Our approach separates implicit neural representation into two components, handling anatomical structure preoperatively and appearance intraoperatively. We tested our method on retrospective patients' data from clinical cases, showing that our method outperforms state-of-the-art while meeting current clinical standards for registration.
arXiv Detail & Related papers (2024-09-18T13:40:59Z)
Geometry-Aware Attenuation Learning for Sparse-View CBCT Reconstruction [53.93674177236367]
Cone Beam Computed Tomography (CBCT) plays a vital role in clinical imaging. Traditional methods typically require hundreds of 2D X-ray projections to reconstruct a high-quality 3D CBCT image. This has led to a growing interest in sparse-view CBCT reconstruction to reduce radiation doses. We introduce a novel geometry-aware encoder-decoder framework to solve this problem.
arXiv Detail & Related papers (2023-03-26T14:38:42Z)
Self-Supervised Generative-Contrastive Learning of Multi-Modal Euclidean Input for 3D Shape Latent Representations: A Dynamic Switching Approach [53.376029341079054]
We propose a combined generative and contrastive neural architecture for learning latent representations of 3D shapes.<n>The architecture uses two encoder branches for voxel grids and multi-view images from the same underlying shape.
arXiv Detail & Related papers (2023-01-11T18:14:24Z)
View-Disentangled Transformer for Brain Lesion Detection [50.4918615815066]
We propose a novel view-disentangled transformer to enhance the extraction of MRI features for more accurate tumour detection. First, the proposed transformer harvests long-range correlation among different positions in a 3D brain scan. Second, the transformer models a stack of slice features as multiple 2D views and enhance these features view-by-view. Third, we deploy the proposed transformer module in a transformer backbone, which can effectively detect the 2D regions surrounding brain lesions.
arXiv Detail & Related papers (2022-09-20T11:58:23Z)
Focused Decoding Enables 3D Anatomical Detection by Transformers [64.36530874341666]
We propose a novel Detection Transformer for 3D anatomical structure detection, dubbed Focused Decoder. Focused Decoder leverages information from an anatomical region atlas to simultaneously deploy query anchors and restrict the cross-attention's field of view. We evaluate our proposed approach on two publicly available CT datasets and demonstrate that Focused Decoder not only provides strong detection results and thus alleviates the need for a vast amount of annotated data but also exhibits exceptional and highly intuitive explainability of results via attention weights.
arXiv Detail & Related papers (2022-07-21T22:17:21Z)
CyTran: A Cycle-Consistent Transformer with Multi-Level Consistency for Non-Contrast to Contrast CT Translation [56.622832383316215]
We propose a novel approach to translate unpaired contrast computed tomography (CT) scans to non-contrast CT scans. Our approach is based on cycle-consistent generative adversarial convolutional transformers, for short, CyTran. Our empirical results show that CyTran outperforms all competing methods.
arXiv Detail & Related papers (2021-10-12T23:25:03Z)
The entire network structure of Crossmodal Transformer [4.605531191013731]
The proposed approach first deep learns skeletal features from 2D X-ray and 3D CT images. As a result, the well-trained network can directly predict the spatial correspondence between arbitrary 2D X-ray and 3D CT.
arXiv Detail & Related papers (2021-04-29T11:47:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.