C3Net: Compound Conditioned ControlNet for Multimodal Content Generation
- URL: http://arxiv.org/abs/2311.17951v1
- Date: Wed, 29 Nov 2023 07:11:56 GMT
- Title: C3Net: Compound Conditioned ControlNet for Multimodal Content Generation
- Authors: Juntao Zhang, Yuehuai Liu, Yu-Wing Tai, Chi-Keung Tang
- Abstract summary: Compound Conditioned ControlNet, C3Net, is a novel generative neural architecture taking conditions from multiple modalities simultaneously.
C3Net adapts the ControlNet architecture to jointly train and make inferences on a production-ready diffusion model.
- Score: 67.5090755991599
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Compound Conditioned ControlNet, C3Net, a novel generative neural
architecture taking conditions from multiple modalities and synthesizing
multimodal contents simultaneously (e.g., image, text, audio). C3Net adapts the
ControlNet architecture to jointly train and make inferences on a
production-ready diffusion model and its trainable copies. Specifically, C3Net
first aligns the conditions from multi-modalities to the same semantic latent
space using modality-specific encoders based on contrastive training. Then, it
generates multimodal outputs based on the aligned latent space, whose semantic
information is combined using a ControlNet-like architecture called Control
C3-UNet. Correspondingly, with this system design, our model offers an improved
solution for joint-modality generation through learning and explaining
multimodal conditions instead of simply taking linear interpolations on the
latent space. Meanwhile, as we align conditions to a unified latent space,
C3Net only requires one trainable Control C3-UNet to work on multimodal
semantic information. Furthermore, our model employs unimodal pretraining on
the condition alignment stage, outperforming the non-pretrained alignment even
on relatively scarce training data and thus demonstrating high-quality compound
condition generation. We contribute the first high-quality tri-modal validation
set to validate quantitatively that C3Net outperforms or is on par with first
and contemporary state-of-the-art multimodal generation. Our codes and
tri-modal dataset will be released.
Related papers
- Multimodality Helps Few-Shot 3D Point Cloud Semantic Segmentation [61.91492500828508]
Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal support samples.
We introduce a cost-free multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality.
We propose a simple yet effective Test-time Adaptive Cross-modal Seg (TACC) technique to mitigate training bias.
arXiv Detail & Related papers (2024-10-29T19:28:41Z) - YOLOO: You Only Learn from Others Once [43.46068978805732]
We propose textbfYOLOO, a novel multi-modal 3D MOT paradigm: You Only Learn from Others Once.
YOLOO empowers the point cloud encoder to learn a unified tri-modal representation (UTR) from point clouds and other modalities, such as images and textual cues, all at once.
Specifically, YOLOO includes two core components: a unified tri-modal encoder (UTEnc) and a flexible geometric constraint (F-GC) module.
arXiv Detail & Related papers (2024-09-01T05:09:32Z) - S3: A Simple Strong Sample-effective Multimodal Dialog System [61.31055673156622]
We present a conceptually simple yet powerful baseline for the multimodal dialog task, an S3 model, that achieves near state-of-the-art results.
The system is based on a pre-trained large language model, pre-trained modality encoders for image and audio, and a trainable modality projector.
arXiv Detail & Related papers (2024-06-26T12:45:43Z) - Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction
Tuning [115.50132185963139]
CM3Leon is a decoder-only multi-modal language model capable of generating and infilling both text and images.
It is the first multi-modal model trained with a recipe adapted from text-only language models.
CM3Leon achieves state-of-the-art performance in text-to-image generation with 5x less training compute than comparable methods.
arXiv Detail & Related papers (2023-09-05T21:27:27Z) - Any-to-Any Generation via Composable Diffusion [111.94094932032205]
Composable Diffusion (CoDi) is a novel generative model capable of generating any combination of output modalities.
CoDi can generate multiple modalities in parallel and its input is not limited to a subset of modalities like text or image.
Highly customizable and flexible, CoDi achieves strong joint-modality generation quality.
arXiv Detail & Related papers (2023-05-19T17:38:32Z) - Neural Attentive Circuits [93.95502541529115]
We introduce a general purpose, yet modular neural architecture called Neural Attentive Circuits (NACs)
NACs learn the parameterization and a sparse connectivity of neural modules without using domain knowledge.
NACs achieve an 8x speedup at inference time while losing less than 3% performance.
arXiv Detail & Related papers (2022-10-14T18:00:07Z) - Knowledge Perceived Multi-modal Pretraining in E-commerce [12.012793707741562]
Current multi-modal pretraining methods for image and text modalities lack robustness in the face of modality-missing and modality-noise.
We propose K3M, which introduces knowledge modality in multi-modal pretraining to correct the noise and supplement the missing of image and text modalities.
arXiv Detail & Related papers (2021-08-20T08:01:28Z) - SrvfNet: A Generative Network for Unsupervised Multiple Diffeomorphic
Shape Alignment [6.404122934568859]
SrvfNet is a generative deep learning framework for the joint multiple alignment of large collections of functional data.
Our proposed framework is fully unsupervised and is capable of aligning to a predefined template as well as jointly predicting an optimal template from data.
We demonstrate the strength of our framework by validating it on synthetic data as well as diffusion profiles from magnetic resonance imaging (MRI) data.
arXiv Detail & Related papers (2021-04-27T19:49:46Z) - Densely connected multidilated convolutional networks for dense
prediction tasks [25.75557472306157]
We propose a novel CNN architecture called densely connected multidilated DenseNet (D3Net)
D3Net involves a novel multidilated convolution that has different dilation factors in a single layer to model different resolutions simultaneously.
Experiments on the image semantic segmentation task using Cityscapes and the audio source separation task using MUSDB18 show that the proposed method has superior performance over state-of-the-art methods.
arXiv Detail & Related papers (2020-11-21T05:15:12Z) - D3Net: Densely connected multidilated DenseNet for music source
separation [25.75557472306157]
Music source separation involves a large input field to model a long-term dependence of an audio signal.
Previous convolutional neural network (CNN)-based approaches address the large input field modeling using sequentially down- and up-sampling feature maps or dilated convolution.
We propose a novel CNN architecture called densely connected dilated DenseNet (D3Net)
D3Net achieves state-of-the-art performance with an average signal to distortion ratio (SDR) of 6.01 dB.
arXiv Detail & Related papers (2020-10-05T01:03:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.