Related papers: Contextual Modeling for 3D Dense Captioning on Point Clouds

Contextual Modeling for 3D Dense Captioning on Point Clouds

URL: http://arxiv.org/abs/2210.03925v1
Date: Sat, 8 Oct 2022 05:33:00 GMT
Title: Contextual Modeling for 3D Dense Captioning on Point Clouds
Authors: Yufeng Zhong, Long Xu, Jiebo Luo, Lin Ma
Abstract summary: 3D dense captioning, as an emerging vision-language task, aims to identify and locate each object from a set of point clouds. We propose two separate modules, namely the Global Context Modeling (GCM) and Local Context Modeling (LCM), in a coarse-to-fine manner. Our proposed model can effectively characterize the object representations and contextual information.
Score: 85.68339840274857
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: 3D dense captioning, as an emerging vision-language task, aims to identify and locate each object from a set of point clouds and generate a distinctive natural language sentence for describing each located object. However, the existing methods mainly focus on mining inter-object relationship, while ignoring contextual information, especially the non-object details and background environment within the point clouds, thus leading to low-quality descriptions, such as inaccurate relative position information. In this paper, we make the first attempt to utilize the point clouds clustering features as the contextual information to supply the non-object details and background environment of the point clouds and incorporate them into the 3D dense captioning task. We propose two separate modules, namely the Global Context Modeling (GCM) and Local Context Modeling (LCM), in a coarse-to-fine manner to perform the contextual modeling of the point clouds. Specifically, the GCM module captures the inter-object relationship among all objects with global contextual information to obtain more complete scene information of the whole point clouds. The LCM module exploits the influence of the neighboring objects of the target object and local contextual information to enrich the object representations. With such global and local contextual modeling strategies, our proposed model can effectively characterize the object representations and contextual information and thereby generate comprehensive and detailed descriptions of the located objects. Extensive experiments on the ScanRefer and Nr3D datasets demonstrate that our proposed method sets a new record on the 3D dense captioning task, and verify the effectiveness of our raised contextual modeling of point clouds.

Related papers

Bi-directional Contextual Attention for 3D Dense Captioning [38.022425401910894]
3D dense captioning is a task involving the localization of objects and the generation of descriptions for each object in a 3D scene. Recent approaches have attempted to incorporate contextual information by modeling relationships with object pairs or aggregating the nearest neighbor features of an object. We introduce BiCA, a transformer encoder-decoder pipeline that engages in 3D dense captioning for each object with Bi-directional Contextual Attention.
arXiv Detail & Related papers (2024-08-13T06:25:54Z)
LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts [60.54912319612113]
Diffusion-based generative models have significantly advanced text-to-image generation but encounter challenges when processing lengthy and intricate text prompts. We present a novel approach leveraging Large Language Models (LLMs) to extract critical components from text prompts. Our evaluation on complex prompts featuring multiple objects demonstrates a substantial improvement in recall compared to baseline diffusion models.
arXiv Detail & Related papers (2023-10-16T17:57:37Z)
PointLLM: Empowering Large Language Models to Understand Point Clouds [63.39876878899682]
PointLLM understands colored object point clouds with human instructions. It generates contextually appropriate responses, illustrating its grasp of point clouds and common sense.
arXiv Detail & Related papers (2023-08-31T17:59:46Z)
DesCo: Learning Object Recognition with Rich Language Descriptions [93.8177229428617]
Recent development in vision-language approaches has instigated a paradigm shift in learning visual recognition models from language supervision. We propose a new description-conditioned (DesCo) paradigm of learning object recognition models with rich language descriptions.
arXiv Detail & Related papers (2023-06-24T21:05:02Z)
Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features. Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z)
Free-form Description Guided 3D Visual Graph Network for Object Grounding in Point Cloud [39.055928838826226]
3D object grounding aims to locate the most relevant target object in a raw point cloud scene based on a free-form language description. We propose a language scene graph module to capture the rich structure and long-distance phrase correlations. Secondly, we introduce a multi-level 3D proposal relation graph module to extract the object-object and object-scene co-occurrence relationships.
arXiv Detail & Related papers (2021-03-30T14:22:36Z)
MLCVNet: Multi-Level Context VoteNet for 3D Object Detection [51.45832752942529]
We propose Multi-Level Context VoteNet (MLCVNet) to recognize 3D objects correlatively, building on the state-of-the-art VoteNet. We introduce three context modules into the voting and classifying stages of VoteNet to encode contextual information at different levels. Our method is an effective way to promote detection accuracy, achieving new state-of-the-art detection performance on challenging 3D object detection datasets.
arXiv Detail & Related papers (2020-04-12T19:10:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.