Related papers: 3AM: 3egment Anything with Geometric Consistency in Videos

3AM: 3egment Anything with Geometric Consistency in Videos

URL: http://arxiv.org/abs/2601.08831v2
Date: Sun, 18 Jan 2026 08:08:27 GMT
Title: 3AM: 3egment Anything with Geometric Consistency in Videos
Authors: Yang-Che Sun, Cheng Sun, Chin-Yang Lin, Fu-En Yang, Min-Hung Chen, Yen-Yu Lin, Yu-Lun Liu,
Abstract summary: 3AM is a training-time enhancement that integrates 3D-aware features from MUSt3R into SAM2.<n>Our method requires only RGB input at inference, with no camera poses or preprocessing.<n>On challenging datasets with wide-baseline motion (ScanNet++, Replica), 3AM substantially outperforms SAM2 and extensions.
Score: 32.069894075133305
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video object segmentation methods like SAM2 achieve strong performance through memory-based architectures but struggle under large viewpoint changes due to reliance on appearance features. Traditional 3D instance segmentation methods address viewpoint consistency but require camera poses, depth maps, and expensive preprocessing. We introduce 3AM, a training-time enhancement that integrates 3D-aware features from MUSt3R into SAM2. Our lightweight Feature Merger fuses multi-level MUSt3R features that encode implicit geometric correspondence. Combined with SAM2's appearance features, the model achieves geometry-consistent recognition grounded in both spatial position and visual similarity. We propose a field-of-view aware sampling strategy ensuring frames observe spatially consistent object regions for reliable 3D correspondence learning. Critically, our method requires only RGB input at inference, with no camera poses or preprocessing. On challenging datasets with wide-baseline motion (ScanNet++, Replica), 3AM substantially outperforms SAM2 and extensions, achieving 90.6% IoU and 71.7% Positive IoU on ScanNet++'s Selected Subset, improving over state-of-the-art VOS methods by +15.9 and +30.4 points. Project page: https://jayisaking.github.io/3AM-Page/

Related papers

MV-SAM: Multi-view Promptable Segmentation using Pointmap Guidance [79.57732829495843]
We introduce MV-SAM, a framework for multi-view segmentation that achieves 3D consistency using pointmaps.<n>MV-SAM lifts images and prompts into 3D space, eliminating the need for explicit 3D networks or annotated 3D data.
arXiv Detail & Related papers (2026-01-25T15:00:37Z)
SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models [13.88629412035865]
Large vision-language models (VLMs) show strong multimodal understanding but still struggle with 3D spatial reasoning.<n>We propose SpaceMind, a multimodal large language model explicitly designed for spatial reasoning solely from RGB inputs.
arXiv Detail & Related papers (2025-11-28T11:04:21Z)
SegMASt3R: Geometry Grounded Segment Matching [23.257530861472656]
We leverage the spatial understanding of 3D foundation models to tackle wide-baseline segment matching.<n>We propose an architecture that uses the inductive bias of these 3D foundation models to match segments across image pairs with up to 180 degree view-point change rotation.
arXiv Detail & Related papers (2025-10-06T17:31:32Z)
ViSTA-SLAM: Visual SLAM with Symmetric Two-view Association [52.34293412010292]
ViSTA-SLAM is a real-time monocular visual SLAM system that operates without requiring camera closures.<n>Our approach achieves superior performance in both camera tracking and dense 3D reconstruction quality.
arXiv Detail & Related papers (2025-09-01T16:12:23Z)
GeoSAM2: Unleashing the Power of SAM2 for 3D Part Segmentation [81.0871900167463]
We introduce GeoSAM2, a prompt-controllable framework for 3D part segmentation.<n>Given a textureless object, we render normal and point maps from predefined viewpoints.<n>We accept simple 2D prompts - clicks or boxes - to guide part selection.<n>The predicted masks are back-projected to the object and aggregated across views.
arXiv Detail & Related papers (2025-08-19T17:58:51Z)
SAI3D: Segment Any Instance in 3D Scenes [68.57002591841034]
We introduce SAI3D, a novel zero-shot 3D instance segmentation approach. Our method partitions a 3D scene into geometric primitives, which are then progressively merged into 3D instance segmentations. Empirical evaluations on ScanNet, Matterport3D and the more challenging ScanNet++ datasets demonstrate the superiority of our approach.
arXiv Detail & Related papers (2023-12-17T09:05:47Z)
BundleSDF: Neural 6-DoF Tracking and 3D Reconstruction of Unknown Objects [89.2314092102403]
We present a near real-time method for 6-DoF tracking of an unknown object from a monocular RGBD video sequence. Our method works for arbitrary rigid objects, even when visual texture is largely absent.
arXiv Detail & Related papers (2023-03-24T17:13:49Z)
LFM-3D: Learnable Feature Matching Across Wide Baselines Using 3D Signals [9.201550006194994]
Learnable matchers often underperform when there exists only small regions of co-visibility between image pairs. We propose LFM-3D, a Learnable Feature Matching framework that uses models based on graph neural networks. We show that the resulting improved correspondences lead to much higher relative posing accuracy for in-the-wild image pairs.
arXiv Detail & Related papers (2023-03-22T17:46:27Z)
DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection [83.18142309597984]
Lidars and cameras are critical sensors that provide complementary information for 3D detection in autonomous driving. We develop a family of generic multi-modal 3D detection models named DeepFusion, which is more accurate than previous methods.
arXiv Detail & Related papers (2022-03-15T18:46:06Z)
Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled Representation [57.11299763566534]
We present a solution to recover 3D pose from multi-view images captured with spatially calibrated cameras. We exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points. Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections.
arXiv Detail & Related papers (2020-04-05T12:52:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.