CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation
- URL: http://arxiv.org/abs/2110.02624v1
- Date: Wed, 6 Oct 2021 09:55:19 GMT
- Title: CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation
- Authors: Aditya Sanghi and Hang Chu and Joseph G. Lambourne and Ye Wang and
Chin-Yi Cheng and Marco Fumero
- Abstract summary: We present a simple yet effective method for zero-shot text-to-shape generation based on a two-stage training process.
Our method not only demonstrates promising zero-shot generalization, but also avoids expensive inference time optimization.
- Score: 16.59461081771521
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: While recent progress has been made in text-to-image generation,
text-to-shape generation remains a challenging problem due to the
unavailability of paired text and shape data at a large scale. We present a
simple yet effective method for zero-shot text-to-shape generation based on a
two-stage training process, which only depends on an unlabelled shape dataset
and a pre-trained image-text network such as CLIP. Our method not only
demonstrates promising zero-shot generalization, but also avoids expensive
inference time optimization and can generate multiple shapes for a given text.
Related papers
- EXIM: A Hybrid Explicit-Implicit Representation for Text-Guided 3D Shape
Generation [124.27302003578903]
This paper presents a new text-guided technique for generating 3D shapes.
We leverage a hybrid 3D representation, namely EXIM, combining the strengths of explicit and implicit representations.
We demonstrate the applicability of our approach to generate indoor scenes with consistent styles using text-induced 3D shapes.
arXiv Detail & Related papers (2023-11-03T05:01:51Z) - LRANet: Towards Accurate and Efficient Scene Text Detection with
Low-Rank Approximation Network [63.554061288184165]
We propose a novel parameterized text shape method based on low-rank approximation.
By exploring the shape correlation among different text contours, our method achieves consistency, compactness, simplicity, and robustness in shape representation.
We implement an accurate and efficient arbitrary-shaped text detector named LRANet.
arXiv Detail & Related papers (2023-06-27T02:03:46Z) - ZeroForge: Feedforward Text-to-Shape Without 3D Supervision [24.558721379714694]
We present ZeroForge, an approach for zero-shot text-to-shape generation that avoids both pitfalls.
To achieve open-vocabulary shape generation, we require careful architectural adaptation of existing feed-forward approaches.
arXiv Detail & Related papers (2023-06-14T00:38:14Z) - Variational Distribution Learning for Unsupervised Text-to-Image
Generation [42.3246826401366]
We propose a text-to-image generation algorithm based on deep neural networks when text captions for images are unavailable during training.
We employ a pretrained CLIP model, which is capable of properly aligning embeddings of images and corresponding texts in a joint space.
We optimize a text-to-image generation model by maximizing the data log-likelihood conditioned on pairs of image-text CLIP embeddings.
arXiv Detail & Related papers (2023-03-28T16:18:56Z) - ISS: Image as Stetting Stone for Text-Guided 3D Shape Generation [91.37036638939622]
This paper presents a new framework called Image as Stepping Stone (ISS) for the task by introducing 2D image as a stepping stone to connect the two modalities.
Our key contribution is a two-stage feature-space-alignment approach that maps CLIP features to shapes.
We formulate a text-guided shape stylization module to dress up the output shapes with novel textures.
arXiv Detail & Related papers (2022-09-09T06:54:21Z) - Towards Implicit Text-Guided 3D Shape Generation [81.22491096132507]
This work explores the challenging task of generating 3D shapes from text.
We propose a new approach for text-guided 3D shape generation, capable of producing high-fidelity shapes with colors that match the given text description.
arXiv Detail & Related papers (2022-03-28T10:20:03Z) - LAFITE: Towards Language-Free Training for Text-to-Image Generation [83.2935513540494]
We propose the first work to train text-to-image generation models without any text data.
Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model.
We obtain state-of-the-art results in the standard text-to-image generation tasks.
arXiv Detail & Related papers (2021-11-27T01:54:45Z) - POINTER: Constrained Progressive Text Generation via Insertion-based
Generative Pre-training [93.79766670391618]
We present POINTER, a novel insertion-based approach for hard-constrained text generation.
The proposed method operates by progressively inserting new tokens between existing tokens in a parallel manner.
The resulting coarse-to-fine hierarchy makes the generation process intuitive and interpretable.
arXiv Detail & Related papers (2020-05-01T18:11:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.