Speedy MASt3R
- URL: http://arxiv.org/abs/2503.10017v1
- Date: Thu, 13 Mar 2025 03:56:22 GMT
- Title: Speedy MASt3R
- Authors: Jingxing Li, Yongjae Lee, Abhay Kumar Yadav, Cheng Peng, Rama Chellappa, Deliang Fan,
- Abstract summary: MASt3R redefines image matching as a 3D task by leveraging DUSt3R and introducing a fast reciprocal matching scheme.<n>Fast MASt3R achieves a 54% reduction in inference time (198 ms to 91 ms per image pair) without sacrificing accuracy.<n>This advancement enables real-time 3D understanding, benefiting applications like mixed reality navigation and large-scale 3D scene reconstruction.
- Score: 68.47052557089631
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Image matching is a key component of modern 3D vision algorithms, essential for accurate scene reconstruction and localization. MASt3R redefines image matching as a 3D task by leveraging DUSt3R and introducing a fast reciprocal matching scheme that accelerates matching by orders of magnitude while preserving theoretical guarantees. This approach has gained strong traction, with DUSt3R and MASt3R collectively cited over 250 times in a short span, underscoring their impact. However, despite its accuracy, MASt3R's inference speed remains a bottleneck. On an A40 GPU, latency per image pair is 198.16 ms, mainly due to computational overhead from the ViT encoder-decoder and Fast Reciprocal Nearest Neighbor (FastNN) matching. To address this, we introduce Speedy MASt3R, a post-training optimization framework that enhances inference efficiency while maintaining accuracy. It integrates multiple optimization techniques, including FlashMatch-an approach leveraging FlashAttention v2 with tiling strategies for improved efficiency, computation graph optimization via layer and tensor fusion having kernel auto-tuning with TensorRT (GraphFusion), and a streamlined FastNN pipeline that reduces memory access time from quadratic to linear while accelerating block-wise correlation scoring through vectorized computation (FastNN-Lite). Additionally, it employs mixed-precision inference with FP16/FP32 hybrid computations (HybridCast), achieving speedup while preserving numerical precision. Evaluated on Aachen Day-Night, InLoc, 7-Scenes, ScanNet1500, and MegaDepth1500, Speedy MASt3R achieves a 54% reduction in inference time (198 ms to 91 ms per image pair) without sacrificing accuracy. This advancement enables real-time 3D understanding, benefiting applications like mixed reality navigation and large-scale 3D scene reconstruction.
Related papers
- Speedy-Splat: Fast 3D Gaussian Splatting with Sparse Pixels and Sparse Primitives [60.217580865237835]
3D Gaussian Splatting (3D-GS) is a recent 3D scene reconstruction technique that enables real-time rendering of novel views by modeling scenes as parametric point clouds of differentiable 3D Gaussians.<n>We identify and address two key inefficiencies in 3D-GS, achieving substantial improvements in rendering speed, model size, and training time.<n>Our Speedy-Splat approach combines these techniques to accelerate average rendering speed by a drastic $6.71times$ across scenes from the Mip-NeRF 360, Tanks & Temples, and Deep Blending datasets with $10.6times$ fewer primitives than 3
arXiv Detail & Related papers (2024-11-30T20:25:56Z) - 3DGS-LM: Faster Gaussian-Splatting Optimization with Levenberg-Marquardt [65.25603275491544]
We present 3DGS-LM, a new method that accelerates the reconstruction of 3D Gaussian Splatting (3DGS)
Our method is 30% faster than the original 3DGS while obtaining the same reconstruction quality optimization.
arXiv Detail & Related papers (2024-09-19T16:31:44Z) - Grounding Image Matching in 3D with MASt3R [8.14650201701567]
We propose to cast matching as a 3D task with DUSt3R, a powerful 3D reconstruction framework based on Transformers.
We propose to augment the DUSt3R network with a new head that outputs dense local features, trained with an additional matching loss.
Our approach, coined MASt3R, significantly outperforms the state of the art on multiple matching tasks.
arXiv Detail & Related papers (2024-06-14T06:46:30Z) - Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like
Speed [42.861344584752]
Previous detector-free matcher LoFTR has shown remarkable matching capability in handling large-viewpoint change and texture-poor scenarios.
We revisit its design choices and derive multiple improvements for both efficiency and accuracy.
Our method can achieve higher accuracy compared with competitive semi-dense matchers.
arXiv Detail & Related papers (2024-03-07T18:58:40Z) - GPU optimization of the 3D Scale-invariant Feature Transform Algorithm
and a Novel BRIEF-inspired 3D Fast Descriptor [5.1537294207900715]
This work details a highly efficient implementation of the 3D scale-invariant feature transform (SIFT) algorithm, for the purpose of machine learning from large sets of medical image data.
The primary operations of the 3D SIFT code are implemented on a graphics processing unit (GPU), including convolution, sub-sampling, and 4D peak detection from scale-space pyramids.
The performance improvements are quantified in keypoint detection and image-to-image matching experiments, using 3D MRI human brain volumes of different people.
arXiv Detail & Related papers (2021-12-19T20:56:40Z) - Displacement-Invariant Cost Computation for Efficient Stereo Matching [122.94051630000934]
Deep learning methods have dominated stereo matching leaderboards by yielding unprecedented disparity accuracy.
But their inference time is typically slow, on the order of seconds for a pair of 540p images.
We propose a emphdisplacement-invariant cost module to compute the matching costs without needing a 4D feature volume.
arXiv Detail & Related papers (2020-12-01T23:58:16Z) - A Real-time Action Representation with Temporal Encoding and Deep
Compression [115.3739774920845]
We propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation.
T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed.
Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.
arXiv Detail & Related papers (2020-06-17T06:30:43Z) - Real-Time High-Performance Semantic Image Segmentation of Urban Street
Scenes [98.65457534223539]
We propose a real-time high-performance DCNN-based method for robust semantic segmentation of urban street scenes.
The proposed method achieves the accuracy of 73.6% and 68.0% mean Intersection over Union (mIoU) with the inference speed of 51.0 fps and 39.3 fps.
arXiv Detail & Related papers (2020-03-11T08:45:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.