Related papers: MVTOP: Multi-View Transformer-based Object Pose-Estimation

MVTOP: Multi-View Transformer-based Object Pose-Estimation

URL: http://arxiv.org/abs/2508.03243v1
Date: Tue, 05 Aug 2025 09:21:14 GMT
Title: MVTOP: Multi-View Transformer-based Object Pose-Estimation
Authors: Lukas Ranftl, Felix Brendel, Bertram Drost, Carsten Steger,
Abstract summary: We present MVTOP, a novel transformer-based method for multi-view rigid object pose estimation.<n>Our method can resolve pose ambiguities that would be impossible to solve with a single view or with a post-processing of single-view poses.
Score: 4.485458895311131
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We present MVTOP, a novel transformer-based method for multi-view rigid object pose estimation. Through an early fusion of the view-specific features, our method can resolve pose ambiguities that would be impossible to solve with a single view or with a post-processing of single-view poses. MVTOP models the multi-view geometry via lines of sight that emanate from the respective camera centers. While the method assumes the camera interior and relative orientations are known for a particular scene, they can vary for each inference. This makes the method versatile. The use of the lines of sight enables MVTOP to correctly predict the correct pose with the merged multi-view information. To show the model's capabilities, we provide a synthetic data set that can only be solved with such holistic multi-view approaches since the poses in the dataset cannot be solved with just one view. Our method outperforms single-view and all existing multi-view approaches on our dataset and achieves competitive results on the YCB-V dataset. To the best of our knowledge, no holistic multi-view method exists that can resolve such pose ambiguities reliably. Our model is end-to-end trainable and does not require any additional data, e.g., depth.

Related papers

One2Any: One-Reference 6D Pose Estimation for Any Object [98.50085481362808]
6D object pose estimation remains challenging for many applications due to dependencies on complete 3D models, multi-view images, or training limited to specific object categories.<n>We propose a novel method One2Any that estimates the relative 6-degrees of freedom (DOF) object pose using only a single reference-single query RGB-D image.<n> Experiments on multiple benchmark datasets demonstrate that our model generalizes well to novel objects, achieving state-of-the-art accuracy and even rivaling methods that require multi-view or CAD inputs, at a fraction of compute.
arXiv Detail & Related papers (2025-05-07T03:54:59Z)
A Global Depth-Range-Free Multi-View Stereo Transformer Network with Pose Embedding [76.44979557843367]
We propose a novel multi-view stereo (MVS) framework that gets rid of the depth range prior.<n>We introduce a Multi-view Disparity Attention (MDA) module to aggregate long-range context information.<n>We explicitly estimate the quality of the current pixel corresponding to sampled points on the epipolar line of the source image.
arXiv Detail & Related papers (2024-11-04T08:50:16Z)
Human Mesh Recovery from Arbitrary Multi-view Images [57.969696744428475]
We propose a divide and conquer framework for Unified Human Mesh Recovery (U-HMR) from arbitrary multi-view images. In particular, U-HMR consists of a decoupled structure and two main components: camera and body decoupling (CBD), camera pose estimation (CPE) and arbitrary view fusion (AVF) We conduct extensive experiments on three public datasets: Human3.6M, MPI-INF-3DHP, and TotalCapture.
arXiv Detail & Related papers (2024-03-19T04:47:56Z)
MVMO: A Multi-Object Dataset for Wide Baseline Multi-View Semantic Segmentation [34.88648947680952]
We present MVMO (Multi-View, Multi-Object dataset): a synthetic dataset of 116,000 scenes containing randomly placed objects of 10 distinct classes. MVMO comprises photorealistic, path-traced image renders, together with semantic segmentation ground truth for every view.
arXiv Detail & Related papers (2022-05-30T22:37:43Z)
VoRTX: Volumetric 3D Reconstruction With Transformers for Voxelwise View Selection and Fusion [68.68537312256144]
VoRTX is an end-to-end volumetric 3D reconstruction network using transformers for wide-baseline, multi-view feature fusion. We train our model on ScanNet and show that it produces better reconstructions than state-of-the-art methods.
arXiv Detail & Related papers (2021-12-01T02:18:11Z)
Multiview Detection with Shadow Transformer (and View-Coherent Data Augmentation) [25.598840284457548]
We propose a novel multiview detector, MVDeTr, that adopts a shadow transformer to aggregate multiview information. Unlike convolutions, shadow transformer attends differently at different positions and cameras to deal with various shadow-like distortions. We report new state-of-the-art accuracy with the proposed system.
arXiv Detail & Related papers (2021-08-12T17:59:02Z)
Learning Implicit 3D Representations of Dressed Humans from Sparse Views [31.584157304372425]
We propose an end-to-end approach that learns an implicit 3D representation of dressed humans from sparse camera views. In the experiments, we show the proposed approach outperforms the state of the art on standard data both quantitatively and qualitatively.
arXiv Detail & Related papers (2021-04-16T10:20:26Z)
Wide-Area Crowd Counting: Multi-View Fusion Networks for Counting in Large Scenes [50.744452135300115]
We propose a deep neural network framework for multi-view crowd counting. Our methods achieve state-of-the-art results compared to other multi-view counting baselines.
arXiv Detail & Related papers (2020-12-02T03:20:30Z)
Multiview Detection with Feature Perspective Transformation [59.34619548026885]
We propose a novel multiview detection system, MVDet. We take an anchor-free approach to aggregate multiview information by projecting feature maps onto the ground plane. Our entire model is end-to-end learnable and achieves 88.2% MODA on the standard Wildtrack dataset.
arXiv Detail & Related papers (2020-07-14T17:58:30Z)
Multi-view Low-rank Preserving Embedding: A Novel Method for Multi-view Representation [11.91574721055601]
This paper proposes a novel multi-view learning method, named Multi-view Low-rank Preserving Embedding (MvLPE) It integrates different views into one centroid view by minimizing the disagreement term, based on distance or similarity matrix among instances. Experiments on six benchmark datasets demonstrate that the proposed method outperforms its counterparts.
arXiv Detail & Related papers (2020-06-14T12:47:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.