MOLTR: Multiple Object Localisation, Tracking, and Reconstruction from
Monocular RGB Videos
- URL: http://arxiv.org/abs/2012.05360v2
- Date: Mon, 15 Feb 2021 03:12:24 GMT
- Title: MOLTR: Multiple Object Localisation, Tracking, and Reconstruction from
Monocular RGB Videos
- Authors: Kejie Li, Hamid Rezatofighi, Ian Reid
- Abstract summary: MOLTR is a solution to object-centric mapping using only monocular image sequences and camera poses.
It is able to localise, track, and reconstruct multiple objects in an online fashion when an RGB camera captures a video of the surrounding.
We evaluate localisation, tracking, and reconstruction on benchmarking datasets for indoor and outdoor scenes.
- Score: 30.541606989348377
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Semantic aware reconstruction is more advantageous than geometric-only
reconstruction for future robotic and AR/VR applications because it represents
not only where things are, but also what things are. Object-centric mapping is
a task to build an object-level reconstruction where objects are separate and
meaningful entities that convey both geometry and semantic information. In this
paper, we present MOLTR, a solution to object-centric mapping using only
monocular image sequences and camera poses. It is able to localise, track, and
reconstruct multiple objects in an online fashion when an RGB camera captures a
video of the surrounding. Given a new RGB frame, MOLTR firstly applies a
monocular 3D detector to localise objects of interest and extract their shape
codes that represent the object shapes in a learned embedding space. Detections
are then merged to existing objects in the map after data association. Motion
state (i.e. kinematics and the motion status) of each object is tracked by a
multiple model Bayesian filter and object shape is progressively refined by
fusing multiple shape code. We evaluate localisation, tracking, and
reconstruction on benchmarking datasets for indoor and outdoor scenes, and show
superior performance over previous approaches.
Related papers
- Real2Code: Reconstruct Articulated Objects via Code Generation [22.833809817357395]
Real2Code is a novel approach to reconstructing articulated objects via code generation.
We first reconstruct its part geometry using an image segmentation model and a shape completion model.
We represent the object parts with oriented bounding boxes, which are input to a fine-tuned large language model to predict joint articulation as code.
arXiv Detail & Related papers (2024-06-12T17:57:06Z) - Reconstructing Hand-Held Objects in 3D [53.277402172488735]
We present a paradigm for handheld object reconstruction that builds on recent breakthroughs in large language/vision models and 3D object datasets.
We use GPT-4(V) to retrieve a 3D object model that matches the object in the image and rigidly align the model to the network-inferred geometry.
Experiments demonstrate that MCC-HO achieves state-of-the-art performance on lab and Internet datasets.
arXiv Detail & Related papers (2024-04-09T17:55:41Z) - Anything-3D: Towards Single-view Anything Reconstruction in the Wild [61.090129285205805]
We introduce Anything-3D, a methodical framework that ingeniously combines a series of visual-language models and the Segment-Anything object segmentation model.
Our approach employs a BLIP model to generate textural descriptions, utilize the Segment-Anything model for the effective extraction of objects of interest, and leverages a text-to-image diffusion model to lift object into a neural radiance field.
arXiv Detail & Related papers (2023-04-19T16:39:51Z) - Single-view 3D Mesh Reconstruction for Seen and Unseen Categories [69.29406107513621]
Single-view 3D Mesh Reconstruction is a fundamental computer vision task that aims at recovering 3D shapes from single-view RGB images.
This paper tackles Single-view 3D Mesh Reconstruction, to study the model generalization on unseen categories.
We propose an end-to-end two-stage network, GenMesh, to break the category boundaries in reconstruction.
arXiv Detail & Related papers (2022-08-04T14:13:35Z) - TSDF++: A Multi-Object Formulation for Dynamic Object Tracking and
Reconstruction [57.1209039399599]
We propose a map representation that allows maintaining a single volume for the entire scene and all the objects therein.
In a multiple dynamic object tracking and reconstruction scenario, our representation allows maintaining accurate reconstruction of surfaces even while they become temporarily occluded by other objects moving in their proximity.
We evaluate the proposed TSDF++ formulation on a public synthetic dataset and demonstrate its ability to preserve reconstructions of occluded surfaces when compared to the standard TSDF map representation.
arXiv Detail & Related papers (2021-05-16T16:15:05Z) - Unsupervised Learning of 3D Object Categories from Videos in the Wild [75.09720013151247]
We focus on learning a model from multiple views of a large collection of object instances.
We propose a new neural network design, called warp-conditioned ray embedding (WCR), which significantly improves reconstruction.
Our evaluation demonstrates performance improvements over several deep monocular reconstruction baselines on existing benchmarks.
arXiv Detail & Related papers (2021-03-30T17:57:01Z) - From Points to Multi-Object 3D Reconstruction [71.17445805257196]
We propose a method to detect and reconstruct multiple 3D objects from a single RGB image.
A keypoint detector localizes objects as center points and directly predicts all object properties, including 9-DoF bounding boxes and 3D shapes.
The presented approach performs lightweight reconstruction in a single-stage, it is real-time capable, fully differentiable and end-to-end trainable.
arXiv Detail & Related papers (2020-12-21T18:52:21Z) - FroDO: From Detections to 3D Objects [29.10716046157072]
FroDO is a method for accurate 3D reconstruction of object instances from RGB video.
It infers object location, pose and shape in a coarse-to-fine manner.
We evaluate on real-world datasets, including Pix3D, Redwood-OS, and ScanNet.
arXiv Detail & Related papers (2020-05-11T14:08:29Z) - CoReNet: Coherent 3D scene reconstruction from a single RGB image [43.74240268086773]
We build on advances in deep learning to reconstruct the shape of a single object given only one RBG image as input.
We propose three extensions: (1) ray-traced skip connections that propagate local 2D information to the output 3D volume in a physically correct manner; (2) a hybrid 3D volume representation that enables building translation equivariant models; and (3) a reconstruction loss tailored to capture overall object geometry.
We reconstruct all objects jointly in one pass, producing a coherent reconstruction, where all objects live in a single consistent 3D coordinate frame relative to the camera and they do not intersect in 3D space.
arXiv Detail & Related papers (2020-04-27T17:53:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.