DualTrack: Sensorless 3D Ultrasound needs Local and Global Context
- URL: http://arxiv.org/abs/2509.09530v1
- Date: Thu, 11 Sep 2025 15:13:02 GMT
- Title: DualTrack: Sensorless 3D Ultrasound needs Local and Global Context
- Authors: Paul F. R. Wilson, Matteo Ronchetti, Rüdiger Göbl, Viktoria Markova, Sebastian Rosenzweig, Raphael Prevost, Parvin Mousavi, Oliver Zettinig,
- Abstract summary: Three-dimensional ultrasound (US) offers many clinical advantages over conventional 2D imaging, yet its widespread adoption is limited by the cost and complexity of traditional 3D systems.<n> Sensorless 3D US, which uses deep learning to estimate a 3D probe trajectory from a sequence of 2D US images, is a promising alternative.<n>We propose DualTrack, a novel dual-encoder architecture that leverages decoupled local and global encoders specialized for their respective scales of feature extraction.
- Score: 1.6151730487640348
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Three-dimensional ultrasound (US) offers many clinical advantages over conventional 2D imaging, yet its widespread adoption is limited by the cost and complexity of traditional 3D systems. Sensorless 3D US, which uses deep learning to estimate a 3D probe trajectory from a sequence of 2D US images, is a promising alternative. Local features, such as speckle patterns, can help predict frame-to-frame motion, while global features, such as coarse shapes and anatomical structures, can situate the scan relative to anatomy and help predict its general shape. In prior approaches, global features are either ignored or tightly coupled with local feature extraction, restricting the ability to robustly model these two complementary aspects. We propose DualTrack, a novel dual-encoder architecture that leverages decoupled local and global encoders specialized for their respective scales of feature extraction. The local encoder uses dense spatiotemporal convolutions to capture fine-grained features, while the global encoder utilizes an image backbone (e.g., a 2D CNN or foundation model) and temporal attention layers to embed high-level anatomical features and long-range dependencies. A lightweight fusion module then combines these features to estimate the trajectory. Experimental results on a large public benchmark show that DualTrack achieves state-of-the-art accuracy and globally consistent 3D reconstructions, outperforming previous methods and yielding an average reconstruction error below 5 mm.
Related papers
- Training-Free Zero-Shot Anomaly Detection in 3D Brain MRI with 2D Foundation Models [0.0]
We introduce a fully training-free framework for ZSAD in 3D brain MRI.<n>The framework constructs localized volumetric tokens by aggregating multi-axis slices processed by 2D foundation models.<n>These 3D patch tokens restore cubic spatial context and integrate directly with distance-based, batch-level anomaly detection pipelines.
arXiv Detail & Related papers (2026-02-17T02:46:45Z) - Joint Semantic and Rendering Enhancements in 3D Gaussian Modeling with Anisotropic Local Encoding [86.55824709875598]
We propose a joint enhancement framework for 3D semantic Gaussian modeling that synergizes both semantic and rendering branches.<n>Unlike conventional point cloud shape encoding, we introduce an anisotropic 3D Gaussian Chebyshev descriptor to capture fine-grained 3D shape details.<n>We employ a cross-scene knowledge transfer module to continuously update learned shape patterns, enabling faster convergence and robust representations.
arXiv Detail & Related papers (2026-01-05T18:33:50Z) - Where, Not What: Compelling Video LLMs to Learn Geometric Causality for 3D-Grounding [0.8883733362171032]
We propose a novel training framework called What-Where Representation Re-Forming (W2R2) to tackle this issue.<n>Our approach fundamentally reshapes the model's internal space by designating 2D features as semantic beacons for "What" identification and 3D features as spatial anchors for "Where" localization.<n>Experiments conducted on ScanRefer and ScanQA demonstrate the effectiveness of W2R2, with significant gains in localization accuracy and robustness.
arXiv Detail & Related papers (2025-10-19T22:40:18Z) - Global-Aware Monocular Semantic Scene Completion with State Space Models [25.621011183332094]
Monocular Semantic Scene Completion (MonoSSC) reconstructs and interprets 3D environments from a single image.<n>Existing methods are often constrained by the local receptive field of Convolutional Networks (CNNs)<n>We introduce GA-MonoSSC, a hybrid architecture for MonoSSC that effectively captures global context in both the 2D image domain and 3D space.
arXiv Detail & Related papers (2025-03-09T11:55:40Z) - Enhancing Free-hand 3D Photoacoustic and Ultrasound Reconstruction using Deep Learning [3.8426872518410997]
This study introduces a motion-based learning network with a global-local self-attention module (MoGLo-Net) to enhance 3D reconstruction in handheld photoacoustic and ultrasound (PAUS) imaging.<n>MoGLo-Net exploits the critical regions, such as fully-developed speckle area or high-echogenic tissue area within successive ultrasound images to accurately estimate motion parameters.
arXiv Detail & Related papers (2025-02-05T11:59:23Z) - GEAL: Generalizable 3D Affordance Learning with Cross-Modal Consistency [50.11520458252128]
Existing 3D affordance learning methods struggle with generalization and robustness due to limited annotated data.<n>We propose GEAL, a novel framework designed to enhance the generalization and robustness of 3D affordance learning by leveraging large-scale pre-trained 2D models.<n>GEAL consistently outperforms existing methods across seen and novel object categories, as well as corrupted data.
arXiv Detail & Related papers (2024-12-12T17:59:03Z) - Graph and Skipped Transformer: Exploiting Spatial and Temporal Modeling Capacities for Efficient 3D Human Pose Estimation [36.93661496405653]
We take a global approach to exploit Transformer-temporal information with a concise Graph and Skipped Transformer architecture.
Specifically, in 3D pose stage, coarse-grained body parts are deployed to construct a fully data-driven adaptive model.
Experiments are conducted on Human3.6M, MPI-INF-3DHP and Human-Eva benchmarks.
arXiv Detail & Related papers (2024-07-03T10:42:09Z) - ALSTER: A Local Spatio-Temporal Expert for Online 3D Semantic
Reconstruction [62.599588577671796]
We propose an online 3D semantic segmentation method that incrementally reconstructs a 3D semantic map from a stream of RGB-D frames.
Unlike offline methods, ours is directly applicable to scenarios with real-time constraints, such as robotics or mixed reality.
arXiv Detail & Related papers (2023-11-29T20:30:18Z) - 3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features [70.50665869806188]
3DiffTection is a state-of-the-art method for 3D object detection from single images.
We fine-tune a diffusion model to perform novel view synthesis conditioned on a single image.
We further train the model on target data with detection supervision.
arXiv Detail & Related papers (2023-11-07T23:46:41Z) - NDC-Scene: Boost Monocular 3D Semantic Scene Completion in Normalized
Device Coordinates Space [77.6067460464962]
Monocular 3D Semantic Scene Completion (SSC) has garnered significant attention in recent years due to its potential to predict complex semantics and geometry shapes from a single image, requiring no 3D inputs.
We identify several critical issues in current state-of-the-art methods, including the Feature Ambiguity of projected 2D features in the ray to the 3D space, the Pose Ambiguity of the 3D convolution, and the Imbalance in the 3D convolution across different depth levels.
We devise a novel Normalized Device Coordinates scene completion network (NDC-Scene) that directly extends the 2
arXiv Detail & Related papers (2023-09-26T02:09:52Z) - Unleash the Potential of Image Branch for Cross-modal 3D Object
Detection [67.94357336206136]
We present a new cross-modal 3D object detector, namely UPIDet, which aims to unleash the potential of the image branch from two aspects.
First, UPIDet introduces a new 2D auxiliary task called normalized local coordinate map estimation.
Second, we discover that the representational capability of the point cloud backbone can be enhanced through the gradients backpropagated from the training objectives of the image branch.
arXiv Detail & Related papers (2023-01-22T08:26:58Z) - Homography Loss for Monocular 3D Object Detection [54.04870007473932]
A differentiable loss function, termed as Homography Loss, is proposed to achieve the goal, which exploits both 2D and 3D information.
Our method yields the best performance compared with the other state-of-the-arts by a large margin on KITTI 3D datasets.
arXiv Detail & Related papers (2022-04-02T03:48:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.