Contextual Modeling for 3D Dense Captioning on Point Clouds
- URL: http://arxiv.org/abs/2210.03925v1
- Date: Sat, 8 Oct 2022 05:33:00 GMT
- Title: Contextual Modeling for 3D Dense Captioning on Point Clouds
- Authors: Yufeng Zhong, Long Xu, Jiebo Luo, Lin Ma
- Abstract summary: 3D dense captioning, as an emerging vision-language task, aims to identify and locate each object from a set of point clouds.
We propose two separate modules, namely the Global Context Modeling (GCM) and Local Context Modeling (LCM), in a coarse-to-fine manner.
Our proposed model can effectively characterize the object representations and contextual information.
- Score: 85.68339840274857
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 3D dense captioning, as an emerging vision-language task, aims to identify
and locate each object from a set of point clouds and generate a distinctive
natural language sentence for describing each located object. However, the
existing methods mainly focus on mining inter-object relationship, while
ignoring contextual information, especially the non-object details and
background environment within the point clouds, thus leading to low-quality
descriptions, such as inaccurate relative position information. In this paper,
we make the first attempt to utilize the point clouds clustering features as
the contextual information to supply the non-object details and background
environment of the point clouds and incorporate them into the 3D dense
captioning task. We propose two separate modules, namely the Global Context
Modeling (GCM) and Local Context Modeling (LCM), in a coarse-to-fine manner to
perform the contextual modeling of the point clouds. Specifically, the GCM
module captures the inter-object relationship among all objects with global
contextual information to obtain more complete scene information of the whole
point clouds. The LCM module exploits the influence of the neighboring objects
of the target object and local contextual information to enrich the object
representations. With such global and local contextual modeling strategies, our
proposed model can effectively characterize the object representations and
contextual information and thereby generate comprehensive and detailed
descriptions of the located objects. Extensive experiments on the ScanRefer and
Nr3D datasets demonstrate that our proposed method sets a new record on the 3D
dense captioning task, and verify the effectiveness of our raised contextual
modeling of point clouds.
Related papers
- Chat-3D v2: Bridging 3D Scene and Large Language Models with Object
Identifiers [62.232809030044116]
We introduce the use of object identifiers to freely reference objects during a conversation.
We propose a two-stage alignment method, which involves learning an attribute-aware token and a relation-aware token for each object.
Experiments conducted on traditional datasets like ScanQA, ScanRefer, and Nr3D/Sr3D showcase the effectiveness of our proposed method.
arXiv Detail & Related papers (2023-12-13T14:27:45Z) - Multi3DRefer: Grounding Text Description to Multiple 3D Objects [15.54885309441946]
We introduce the task of localizing a flexible number of objects in real-world 3D scenes using natural language descriptions.
Our dataset contains 61926 descriptions of 11609 objects, where zero, single or multiple target objects are referenced by each description.
We develop a better baseline leveraging 2D features from CLIP by rendering proposals online with contrastive learning, which outperforms the state of the art on the ScanRefer benchmark.
arXiv Detail & Related papers (2023-09-11T06:03:39Z) - PointLLM: Empowering Large Language Models to Understand Point Clouds [67.1783384610417]
PointLLM understands colored object point clouds with human instructions.
It generates contextually appropriate responses, illustrating its grasp of point clouds and common sense.
arXiv Detail & Related papers (2023-08-31T17:59:46Z) - DesCo: Learning Object Recognition with Rich Language Descriptions [93.8177229428617]
Recent development in vision-language approaches has instigated a paradigm shift in learning visual recognition models from language supervision.
We propose a new description-conditioned (DesCo) paradigm of learning object recognition models with rich language descriptions.
arXiv Detail & Related papers (2023-06-24T21:05:02Z) - Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features.
Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z) - Free-form Description Guided 3D Visual Graph Network for Object
Grounding in Point Cloud [39.055928838826226]
3D object grounding aims to locate the most relevant target object in a raw point cloud scene based on a free-form language description.
We propose a language scene graph module to capture the rich structure and long-distance phrase correlations.
Secondly, we introduce a multi-level 3D proposal relation graph module to extract the object-object and object-scene co-occurrence relationships.
arXiv Detail & Related papers (2021-03-30T14:22:36Z) - MLCVNet: Multi-Level Context VoteNet for 3D Object Detection [51.45832752942529]
We propose Multi-Level Context VoteNet (MLCVNet) to recognize 3D objects correlatively, building on the state-of-the-art VoteNet.
We introduce three context modules into the voting and classifying stages of VoteNet to encode contextual information at different levels.
Our method is an effective way to promote detection accuracy, achieving new state-of-the-art detection performance on challenging 3D object detection datasets.
arXiv Detail & Related papers (2020-04-12T19:10:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.