Related papers: Neural Assembler: Learning to Generate Fine-Grained Robotic Assembly Instructions from Multi-View Images

Neural Assembler: Learning to Generate Fine-Grained Robotic Assembly Instructions from Multi-View Images

URL: http://arxiv.org/abs/2404.16423v1
Date: Thu, 25 Apr 2024 08:53:23 GMT
Title: Neural Assembler: Learning to Generate Fine-Grained Robotic Assembly Instructions from Multi-View Images
Authors: Hongyu Yan, Yadong Mu,
Abstract summary: This paper introduces a novel task: translating multi-view images of a structural 3D model into a detailed sequence of assembly instructions. We propose an end-to-end model known as the Neural Assembler.
Score: 24.10809783713574
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Image-guided object assembly represents a burgeoning research topic in computer vision. This paper introduces a novel task: translating multi-view images of a structural 3D model (for example, one constructed with building blocks drawn from a 3D-object library) into a detailed sequence of assembly instructions executable by a robotic arm. Fed with multi-view images of the target 3D model for replication, the model designed for this task must address several sub-tasks, including recognizing individual components used in constructing the 3D model, estimating the geometric pose of each component, and deducing a feasible assembly order adhering to physical rules. Establishing accurate 2D-3D correspondence between multi-view images and 3D objects is technically challenging. To tackle this, we propose an end-to-end model known as the Neural Assembler. This model learns an object graph where each vertex represents recognized components from the images, and the edges specify the topology of the 3D model, enabling the derivation of an assembly plan. We establish benchmarks for this task and conduct comprehensive empirical evaluations of Neural Assembler and alternative solutions. Our experiments clearly demonstrate the superiority of Neural Assembler.

Related papers

Assembler: Scalable 3D Part Assembly via Anchor Point Diffusion [39.08891847512135]
We present Assembler, a scalable and generalizable framework for 3D part assembly.<n>It handles diverse, in-the-wild objects with varying part counts, geometries, and structures.<n>It achieves state-of-the-art performance on PartNet and is the first to demonstrate high-quality assembly for complex, real-world objects.
arXiv Detail & Related papers (2025-06-20T15:25:20Z)
Object-X: Learning to Reconstruct Multi-Modal 3D Object Representations [112.29763628638112]
Object-X is a versatile multi-modal 3D representation framework.<n>It can encoding rich object embeddings and decoding them back into geometric and visual reconstructions.<n>It supports a range of downstream tasks, including scene alignment, single-image 3D object reconstruction, and localization.
arXiv Detail & Related papers (2025-06-05T09:14:42Z)
PartGen: Part-level 3D Generation and Reconstruction with Multi-View Diffusion Models [63.1432721793683]
We introduce PartGen, a novel approach that generates 3D objects composed of meaningful parts starting from text, an image, or an unstructured 3D object. We evaluate our method on generated and real 3D assets and show that it outperforms segmentation and part-extraction baselines by a large margin.
arXiv Detail & Related papers (2024-12-24T18:59:43Z)
SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR. SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds. We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z)
ComboVerse: Compositional 3D Assets Creation Using Spatially-Aware Diffusion Guidance [76.7746870349809]
We present ComboVerse, a 3D generation framework that produces high-quality 3D assets with complex compositions by learning to combine multiple models. Our proposed framework emphasizes spatial alignment of objects, compared with standard score distillation sampling.
arXiv Detail & Related papers (2024-03-19T03:39:43Z)
Anything-3D: Towards Single-view Anything Reconstruction in the Wild [61.090129285205805]
We introduce Anything-3D, a methodical framework that ingeniously combines a series of visual-language models and the Segment-Anything object segmentation model. Our approach employs a BLIP model to generate textural descriptions, utilize the Segment-Anything model for the effective extraction of objects of interest, and leverages a text-to-image diffusion model to lift object into a neural radiance field.
arXiv Detail & Related papers (2023-04-19T16:39:51Z)
A Simple Baseline for Multi-Camera 3D Object Detection [94.63944826540491]
3D object detection with surrounding cameras has been a promising direction for autonomous driving. We present SimMOD, a Simple baseline for Multi-camera Object Detection. We conduct extensive experiments on the 3D object detection benchmark of nuScenes to demonstrate the effectiveness of SimMOD.
arXiv Detail & Related papers (2022-08-22T03:38:01Z)
Translating a Visual LEGO Manual to a Machine-Executable Plan [26.0127179598152]
We study the problem of translating an image-based, step-by-step assembly manual created by human designers into machine-interpretable instructions. We present a novel learning-based framework, the Manual-to-Executable-Plan Network (MEPNet), which reconstructs the assembly steps from a sequence of manual images.
arXiv Detail & Related papers (2022-07-25T23:35:46Z)
Mask2CAD: 3D Shape Prediction by Learning to Segment and Retrieve [54.054575408582565]
We propose to leverage existing large-scale datasets of 3D models to understand the underlying 3D structure of objects seen in an image. We present Mask2CAD, which jointly detects objects in real-world images and for each detected object, optimize for the most similar CAD model and its pose. This produces a clean, lightweight representation of the objects in an image.
arXiv Detail & Related papers (2020-07-26T00:08:37Z)
Generative 3D Part Assembly via Dynamic Graph Learning [34.108515032411695]
Part assembly is a challenging yet crucial task in 3D computer vision and robotics. We propose an assembly-oriented dynamic graph learning framework that leverages an iterative graph neural network as a backbone.
arXiv Detail & Related papers (2020-06-14T04:26:42Z)
Learning Unsupervised Hierarchical Part Decomposition of 3D Objects from a Single RGB Image [102.44347847154867]
We propose a novel formulation that allows to jointly recover the geometry of a 3D object as a set of primitives. Our model recovers the higher level structural decomposition of various objects in the form of a binary tree of primitives. Our experiments on the ShapeNet and D-FAUST datasets demonstrate that considering the organization of parts indeed facilitates reasoning about 3D geometry.
arXiv Detail & Related papers (2020-04-02T17:58:05Z)
Learning 3D Part Assembly from a Single Image [20.175502864488493]
We introduce a novel problem, single-image-guided 3D part assembly, along with a learningbased solution. We study this problem in the setting of furniture assembly from a given complete set of parts and a single image depicting the entire assembled object.
arXiv Detail & Related papers (2020-03-21T21:19:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.