Related papers: MORE: Multi-Order RElation Mining for Dense Captioning in 3D Scenes

MORE: Multi-Order RElation Mining for Dense Captioning in 3D Scenes

URL: http://arxiv.org/abs/2203.05203v1
Date: Thu, 10 Mar 2022 07:26:15 GMT
Title: MORE: Multi-Order RElation Mining for Dense Captioning in 3D Scenes
Authors: Yang Jiao, Shaoxiang Chen, Zequn Jie, Jingjing Chen, Lin Ma, Yu-Gang Jiang
Abstract summary: Existing methods only treat such relations as by-products of object feature learning in graphs without specifically encoding them. We propose MORE, a Multi-Order RElation mining model, to support generating more descriptive and comprehensive captions. Our MORE encodes object relations in a progressive manner since complex relations can be deduced from a limited number of basic ones.
Score: 89.75025195440287
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: 3D dense captioning is a recently-proposed novel task, where point clouds contain more geometric information than the 2D counterpart. However, it is also more challenging due to the higher complexity and wider variety of inter-object relations. Existing methods only treat such relations as by-products of object feature learning in graphs without specifically encoding them, which leads to sub-optimal results. In this paper, aiming at improving 3D dense captioning via capturing and utilizing the complex relations in the 3D scene, we propose MORE, a Multi-Order RElation mining model, to support generating more descriptive and comprehensive captions. Technically, our MORE encodes object relations in a progressive manner since complex relations can be deduced from a limited number of basic ones. We first devise a novel Spatial Layout Graph Convolution (SLGC), which semantically encodes several first-order relations as edges of a graph constructed over 3D object proposals. Next, from the resulting graph, we further extract multiple triplets which encapsulate basic first-order relations as the basic unit and construct several Object-centric Triplet Attention Graphs (OTAG) to infer multi-order relations for every target object. The updated node features from OTAG are aggregated and fed into the caption decoder to provide abundant relational cues so that captions including diverse relations with context objects can be generated. Extensive experiments on the Scan2Cap dataset prove the effectiveness of our proposed MORE and its components, and we also outperform the current state-of-the-art method.

Related papers

Descrip3D: Enhancing Large Language Model-based 3D Scene Understanding with Object-Level Text Descriptions [28.185661905201222]
Descrip3D is a novel framework that explicitly encodes the relationships between objects using natural language.<n>It allows for unified reasoning across various tasks such as grounding, captioning, and question answering.
arXiv Detail & Related papers (2025-07-19T09:19:16Z)
DIPO: Dual-State Images Controlled Articulated Object Generation Powered by Diverse Data [67.99373622902827]
DIPO is a framework for controllable generation of articulated 3D objects from a pair of images.<n>We propose a dual-image diffusion model that captures relationships between the image pair to generate part layouts and joint parameters.<n>We propose PM-X, a large-scale dataset of complex articulated 3D objects, accompanied by rendered images, URDF annotations, and textual descriptions.
arXiv Detail & Related papers (2025-05-26T18:55:14Z)
3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding [0.5755004576310334]
A 3D scene graph represents a compact scene model by capturing both the objects present and the semantic relationships between them.<n>We propose 3DGraphLLM, a method for constructing a learnable representation of a 3D scene graph that explicitly incorporates semantic relationships.
arXiv Detail & Related papers (2024-12-24T14:21:58Z)
Open-Vocabulary Octree-Graph for 3D Scene Understanding [54.11828083068082]
Octree-Graph is a novel scene representation for open-vocabulary 3D scene understanding. An adaptive-octree structure is developed that stores semantics and depicts the occupancy of an object adjustably according to its shape.
arXiv Detail & Related papers (2024-11-25T10:14:10Z)
Beyond Bare Queries: Open-Vocabulary Object Grounding with 3D Scene Graph [0.3926357402982764]
We propose a modular approach called BBQ that constructs 3D scene graph representation with metric and semantic edges. BBQ employs robust DINO-powered associations to construct 3D object-centric map. We show that BBQ takes a leading place in open-vocabulary 3D semantic segmentation compared to other zero-shot methods.
arXiv Detail & Related papers (2024-06-11T09:57:04Z)
TeMO: Towards Text-Driven 3D Stylization for Multi-Object Meshes [67.5351491691866]
We present a novel framework, dubbed TeMO, to parse multi-object 3D scenes and edit their styles. Our method can synthesize high-quality stylized content and outperform the existing methods over a wide range of multi-object 3D meshes.
arXiv Detail & Related papers (2023-12-07T12:10:05Z)
GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs [74.98581417902201]
We propose a novel framework to generate compositional 3D scenes from scene graphs. By exploiting node and edge information in scene graphs, our method makes better use of the pretrained text-to-image diffusion model. We conduct both qualitative and quantitative experiments to validate the effectiveness of GraphDreamer.
arXiv Detail & Related papers (2023-11-30T18:59:58Z)
Enhancing Scene Graph Generation with Hierarchical Relationships and Commonsense Knowledge [7.28830964611216]
This work introduces an enhanced approach to generating scene graphs by both a relationship hierarchy and commonsense knowledge. We implement a robust commonsense validation pipeline that harnesses foundation models to critique the results from the scene graph prediction system. Experiments on Visual Genome and OpenImage V6 datasets demonstrate that the proposed modules can be seamlessly integrated as plug-and-play enhancements to existing scene graph generation algorithms.
arXiv Detail & Related papers (2023-11-21T06:03:20Z)
PolarMOT: How Far Can Geometric Relations Take Us in 3D Multi-Object Tracking? [62.997667081978825]
We encode 3D detections as nodes in a graph, where spatial and temporal pairwise relations among objects are encoded via localized polar coordinates on graph edges. This allows our graph neural network to learn to effectively encode temporal and spatial interactions. We establish a new state-of-the-art on nuScenes dataset and, more importantly, show that our method, PolarMOT, generalizes remarkably well across different locations.
arXiv Detail & Related papers (2022-08-03T10:06:56Z)
Semantic Compositional Learning for Low-shot Scene Graph Generation [122.51930904132685]
Many scene graph generation (SGG) models solely use the limited annotated relation triples for training. We propose a novel semantic compositional learning strategy that makes it possible to construct additional, realistic relation triples. For three recent SGG models, adding our strategy improves their performance by close to 50%, and all of them substantially exceed the current state-of-the-art.
arXiv Detail & Related papers (2021-08-19T10:13:55Z)
Free-form Description Guided 3D Visual Graph Network for Object Grounding in Point Cloud [39.055928838826226]
3D object grounding aims to locate the most relevant target object in a raw point cloud scene based on a free-form language description. We propose a language scene graph module to capture the rich structure and long-distance phrase correlations. Secondly, we introduce a multi-level 3D proposal relation graph module to extract the object-object and object-scene co-occurrence relationships.
arXiv Detail & Related papers (2021-03-30T14:22:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.