Parametric-ControlNet: Multimodal Control in Foundation Models for Precise Engineering Design Synthesis
- URL: http://arxiv.org/abs/2412.04707v1
- Date: Fri, 06 Dec 2024 01:40:10 GMT
- Title: Parametric-ControlNet: Multimodal Control in Foundation Models for Precise Engineering Design Synthesis
- Authors: Rui Zhou, Yanxia Zhang, Chenyang Yuan, Frank Permenter, Nikos Arechiga, Matt Klenk, Faez Ahmed,
- Abstract summary: This paper introduces a generative model designed for multimodal control over text-to-image foundation generative AI models such as Stable Diffusion.<n>Our model proposes parametric, image, and text control modalities to enhance design precision and diversity.
- Score: 9.900586490845694
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces a generative model designed for multimodal control over text-to-image foundation generative AI models such as Stable Diffusion, specifically tailored for engineering design synthesis. Our model proposes parametric, image, and text control modalities to enhance design precision and diversity. Firstly, it handles both partial and complete parametric inputs using a diffusion model that acts as a design autocomplete co-pilot, coupled with a parametric encoder to process the information. Secondly, the model utilizes assembly graphs to systematically assemble input component images, which are then processed through a component encoder to capture essential visual data. Thirdly, textual descriptions are integrated via CLIP encoding, ensuring a comprehensive interpretation of design intent. These diverse inputs are synthesized through a multimodal fusion technique, creating a joint embedding that acts as the input to a module inspired by ControlNet. This integration allows the model to apply robust multimodal control to foundation models, facilitating the generation of complex and precise engineering designs. This approach broadens the capabilities of AI-driven design tools and demonstrates significant advancements in precise control based on diverse data modalities for enhanced design generation.
Related papers
- LOCOFY Large Design Models -- Design to code conversion solution [0.0]
We introduce the Large Design Models paradigm specifically trained on designs and webpages to enable seamless conversion from design-to-code.<n>We have developed a training and inference pipeline by incorporating data engineering and appropriate model architecture modification.<n>Our models illustrated exceptional end-to-end design-to-code conversion accuracy using a novel preview match score metric.
arXiv Detail & Related papers (2025-07-22T03:54:57Z) - Structural Similarity-Inspired Unfolding for Lightweight Image Super-Resolution [88.20464308588889]
We propose a Structural Similarity-Inspired Unfolding (SSIU) method for efficient image SR.<n>This method is designed through unfolding an SR optimization function constrained by structural similarity.<n>Our model outperforms current state-of-the-art models, boasting lower parameter counts and reduced memory consumption.
arXiv Detail & Related papers (2025-06-13T14:29:40Z) - CreatiDesign: A Unified Multi-Conditional Diffusion Transformer for Creative Graphic Design [69.83433430133302]
CreatiDesign is a systematic solution for automated graphic design covering both model architecture and dataset construction.<n>First, we design a unified multi-condition driven architecture that enables flexible and precise integration of heterogeneous design elements.<n> Furthermore, to ensure that each condition precisely controls its designated image region, we propose a multimodal attention mask mechanism.
arXiv Detail & Related papers (2025-05-25T12:14:23Z) - Advancing vision-language models in front-end development via data synthesis [30.287628180320137]
We propose a reflective agentic workflow that synthesizes high-quality image-text data to capture the diverse characteristics of front-end development.
This workflow automates the extraction of self-containedfootnoteA textbfself-contained code snippet from real-world projects, renders the corresponding visual outputs, and generates detailed descriptions that link design elements to functional code.
We build a large vision-language model, Flame, trained on the synthesized datasets and demonstrate its effectiveness in generating React code via the $textpass@k$ metric.
arXiv Detail & Related papers (2025-03-03T14:54:01Z) - OminiControl: Minimal and Universal Control for Diffusion Transformer [68.3243031301164]
OminiControl is a framework that integrates image conditions into pre-trained Diffusion Transformer (DiT) models.<n>At its core, OminiControl leverages a parameter reuse mechanism, enabling the DiT to encode image conditions using itself as a powerful backbone.<n>OminiControl addresses a wide range of image conditioning tasks in a unified manner, including subject-driven generation and spatially-aligned conditions.
arXiv Detail & Related papers (2024-11-22T17:55:15Z) - ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer [40.32254040909614]
We propose ACE, an All-round Creator and Editor, for visual generation tasks.
We first introduce a unified condition format termed Long-context Condition Unit (LCU)
We then propose a novel Transformer-based diffusion model that uses LCU as input, aiming for joint training across various generation and editing tasks.
arXiv Detail & Related papers (2024-09-30T17:56:27Z) - Bridging Design Gaps: A Parametric Data Completion Approach With Graph Guided Diffusion Models [9.900586490845694]
This study introduces a generative imputation model leveraging graph attention networks and tabular diffusion models for completing missing parametric data in engineering designs.
We demonstrate our model significantly outperforms existing classical methods, such as MissForest, hotDeck, PPCA, and TabCSDI in both the accuracy and diversity of imputation options.
The graph model helps accurately capture and impute complex parametric interdependencies from an assembly graph, which is key for design problems.
arXiv Detail & Related papers (2024-06-17T16:03:17Z) - Text2VP: Generative AI for Visual Programming and Parametric Modeling [6.531561475204309]
This study creates and investigates an innovative application of generative AI in parametric modeling by leveraging a customized Text-to-Visual Programming (Text2VP) GPT derived from GPT-4.
The primary focus is on automating the generation of graph-based visual programming, including parameters and the links among the parameters, through AI-generated scripts.
Our testing demonstrates Text2VP's capability to generate working parametric models.
arXiv Detail & Related papers (2024-06-09T02:22:20Z) - Generative Design through Quality-Diversity Data Synthesis and Language Models [5.196236145367301]
Two fundamental challenges face generative models in engineering applications: the acquisition of high-performing, diverse datasets, and the adherence to precise constraints in generated designs.
We propose a novel approach combining optimization, constraint satisfaction, and language models to tackle these challenges in architectural design.
arXiv Detail & Related papers (2024-05-16T11:30:08Z) - Compositional Generative Inverse Design [69.22782875567547]
Inverse design, where we seek to design input variables in order to optimize an underlying objective function, is an important problem.
We show that by instead optimizing over the learned energy function captured by the diffusion model, we can avoid such adversarial examples.
In an N-body interaction task and a challenging 2D multi-airfoil design task, we demonstrate that by composing the learned diffusion model at test time, our method allows us to design initial states and boundary shapes.
arXiv Detail & Related papers (2024-01-24T01:33:39Z) - Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z) - SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation [89.47132156950194]
We present a novel framework built to simplify 3D asset generation for amateur users.
Our method supports a variety of input modalities that can be easily provided by a human.
Our model can combine all these tasks into one swiss-army-knife tool.
arXiv Detail & Related papers (2022-12-08T18:59:05Z) - Unifying Flow, Stereo and Depth Estimation [121.54066319299261]
We present a unified formulation and model for three motion and 3D perception tasks.
We formulate all three tasks as a unified dense correspondence matching problem.
Our model naturally enables cross-task transfer since the model architecture and parameters are shared across tasks.
arXiv Detail & Related papers (2022-11-10T18:59:54Z) - OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and
Generation [52.037766778458504]
We propose an Omni-perception Pre-Trainer (OPT) for cross-modal understanding and generation.
OPT is constructed in an encoder-decoder framework, including three single-modal encoders to generate token-based embeddings for each modality.
OPT can learn strong image-text-audio multi-modal representations and achieve promising results on a variety of cross-modal understanding and generation tasks.
arXiv Detail & Related papers (2021-07-01T06:59:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.