Parts2Words: Learning Joint Embedding of Point Clouds and Texts by
Bidirectional Matching between Parts and Words
- URL: http://arxiv.org/abs/2107.01872v2
- Date: Sun, 2 Apr 2023 02:12:38 GMT
- Title: Parts2Words: Learning Joint Embedding of Point Clouds and Texts by
Bidirectional Matching between Parts and Words
- Authors: Chuan Tang, Xi Yang, Bojian Wu, Zhizhong Han, Yi Chang
- Abstract summary: We propose to learn joint embedding of point clouds and texts by bidirectional matching between parts from shapes and words from texts.
Specifically, we first segment the point clouds into parts, and then leverage optimal transport method to match parts and words in an optimized feature space.
Experiments demonstrate that our method achieves a significant improvement in accuracy over the SOTAs on multi-modal retrieval tasks.
- Score: 32.47815081044594
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Shape-Text matching is an important task of high-level shape understanding.
Current methods mainly represent a 3D shape as multiple 2D rendered views,
which obviously can not be understood well due to the structural ambiguity
caused by self-occlusion in the limited number of views. To resolve this issue,
we directly represent 3D shapes as point clouds, and propose to learn joint
embedding of point clouds and texts by bidirectional matching between parts
from shapes and words from texts. Specifically, we first segment the point
clouds into parts, and then leverage optimal transport method to match parts
and words in an optimized feature space, where each part is represented by
aggregating features of all points within it and each word is abstracted by its
contextual information. We optimize the feature space in order to enlarge the
similarities between the paired training samples, while simultaneously
maximizing the margin between the unpaired ones. Experiments demonstrate that
our method achieves a significant improvement in accuracy over the SOTAs on
multi-modal retrieval tasks under the Text2Shape dataset. Codes are available
at https://github.com/JLUtangchuan/Parts2Words.
Related papers
- LESS: Label-Efficient and Single-Stage Referring 3D Segmentation [55.06002976797879]
Referring 3D is a visual-language task that segments all points of the specified object from a 3D point cloud described by a sentence of query.
We propose a novel Referring 3D pipeline, Label-Efficient and Single-Stage, dubbed LESS, which is only under the supervision of efficient binary mask.
We achieve state-of-the-art performance on ScanRefer dataset by surpassing the previous methods about 3.7% mIoU using only binary labels.
arXiv Detail & Related papers (2024-10-17T07:47:41Z) - GHOST: Grounded Human Motion Generation with Open Vocabulary Scene-and-Text Contexts [48.28000728061778]
We propose a method that integrates an open vocabulary scene encoder into the architecture, establishing a robust connection between text and scene.
Our methodology achieves up to a 30% reduction in the goal object distance metric compared to the prior state-of-the-art baseline model.
arXiv Detail & Related papers (2024-04-08T18:24:12Z) - Looking at words and points with attention: a benchmark for
text-to-shape coherence [17.340484439401894]
The evaluation of coherence between generated 3D shapes and input textual descriptions lacks a clear benchmark.
We employ large language models to automatically refine descriptions associated with shapes.
To validate our approach, we conduct a user study and compare quantitatively our metric with existing ones.
The refined dataset, the new metric and a set of text-shape pairs validated by the user study comprise a novel, fine-grained benchmark.
arXiv Detail & Related papers (2023-09-14T17:59:48Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual
Grounding [4.447173454116189]
3D visual grounding aims to find the object within point clouds mentioned by free-form natural language descriptions with rich semantic cues.
We present EDA that Explicitly Decouples the textual attributes in a sentence.
We further introduce a new visual grounding task, locating objects without object names, which can thoroughly evaluate the model's dense alignment capacity.
arXiv Detail & Related papers (2022-09-29T17:00:22Z) - ISS: Image as Stetting Stone for Text-Guided 3D Shape Generation [91.37036638939622]
This paper presents a new framework called Image as Stepping Stone (ISS) for the task by introducing 2D image as a stepping stone to connect the two modalities.
Our key contribution is a two-stage feature-space-alignment approach that maps CLIP features to shapes.
We formulate a text-guided shape stylization module to dress up the output shapes with novel textures.
arXiv Detail & Related papers (2022-09-09T06:54:21Z) - Towards Implicit Text-Guided 3D Shape Generation [81.22491096132507]
This work explores the challenging task of generating 3D shapes from text.
We propose a new approach for text-guided 3D shape generation, capable of producing high-fidelity shapes with colors that match the given text description.
arXiv Detail & Related papers (2022-03-28T10:20:03Z) - ABCNet v2: Adaptive Bezier-Curve Network for Real-time End-to-end Text
Spotting [108.93803186429017]
End-to-end text-spotting aims to integrate detection and recognition in a unified framework.
Here, we tackle end-to-end text spotting by presenting Adaptive Bezier Curve Network v2 (ABCNet v2)
Our main contributions are four-fold: 1) For the first time, we adaptively fit arbitrarily-shaped text by a parameterized Bezier curve, which, compared with segmentation-based methods, can not only provide structured output but also controllable representation.
Comprehensive experiments conducted on various bilingual (English and Chinese) benchmark datasets demonstrate that ABCNet v2 can achieve state-of-the
arXiv Detail & Related papers (2021-05-08T07:46:55Z) - PuzzleNet: Scene Text Detection by Segment Context Graph Learning [9.701699882807251]
We propose a novel decomposition-based method, termed Puzzle Networks (PuzzleNet), to address the challenging scene text detection task.
By building segments as context graphs, MSGCN effectively employs segment context to predict combinations of segments.
Our method can achieve better or comparable performance than current state-of-the-arts, which is beneficial from the exploitation of segment context graph.
arXiv Detail & Related papers (2020-02-26T09:21:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.