CMT: A Cascade MAR with Topology Predictor for Multimodal Conditional CAD Generation
- URL: http://arxiv.org/abs/2504.20830v1
- Date: Tue, 29 Apr 2025 14:52:28 GMT
- Title: CMT: A Cascade MAR with Topology Predictor for Multimodal Conditional CAD Generation
- Authors: Jianyu Wu, Yizhou Wang, Xiangyu Yue, Xinzhu Ma, Jingyang Guo, Dongzhan Zhou, Wanli Ouyang, Shixiang Tang,
- Abstract summary: We propose a cascade MAR with topology predictor (CMT), the first multimodal framework for CAD generation based on Boundary Representation (B-Rep)<n>Specifically, the cascade MAR can effectively capture the edge-counters-surface'' priors that are essential in B-Reps.<n>We develop a large-scale multimodal CAD dataset, mmABC, which includes over 1.3 million B-Rep models with multimodal annotations.
- Score: 59.76687657887415
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While accurate and user-friendly Computer-Aided Design (CAD) is crucial for industrial design and manufacturing, existing methods still struggle to achieve this due to their over-simplified representations or architectures incapable of supporting multimodal design requirements. In this paper, we attempt to tackle this problem from both methods and datasets aspects. First, we propose a cascade MAR with topology predictor (CMT), the first multimodal framework for CAD generation based on Boundary Representation (B-Rep). Specifically, the cascade MAR can effectively capture the ``edge-counters-surface'' priors that are essential in B-Reps, while the topology predictor directly estimates topology in B-Reps from the compact tokens in MAR. Second, to facilitate large-scale training, we develop a large-scale multimodal CAD dataset, mmABC, which includes over 1.3 million B-Rep models with multimodal annotations, including point clouds, text descriptions, and multi-view images. Extensive experiments show the superior of CMT in both conditional and unconditional CAD generation tasks. For example, we improve Coverage and Valid ratio by +10.68% and +10.3%, respectively, compared to state-of-the-art methods on ABC in unconditional generation. CMT also improves +4.01 Chamfer on image conditioned CAD generation on mmABC. The dataset, code and pretrained network shall be released.
Related papers
- HoLa: B-Rep Generation using a Holistic Latent Representation [51.07878285790399]
We introduce a novel representation for learning and generating Computer-Aided Design (CAD) models in the form of $textitboundary representations$ (B-Reps)<n>Our representation unifies the continuous geometric properties of B-Rep primitives in different orders.<n>Our method significantly reduces ambiguities, redundancies, and incoherences among the generated B-Rep primitives.
arXiv Detail & Related papers (2025-04-19T10:34:24Z) - Hierarchical and Step-Layer-Wise Tuning of Attention Specialty for Multi-Instance Synthesis in Diffusion Transformers [22.269573676129152]
Text-to-image (T2I) generation models often struggle with multi-instance synthesis (MIS)<n>Traditional MIS control methods for UNet architectures fail to adapt to DiT-based models.<n>We propose a training-free approach for enhancing MIS in DiT-based models.
arXiv Detail & Related papers (2025-04-14T11:59:58Z) - Multimodal Task Representation Memory Bank vs. Catastrophic Forgetting in Anomaly Detection [6.991692485111346]
Unsupervised Continuous Anomaly Detection (UCAD) faces significant challenges in multi-task representation learning.
We propose the Multimodal Task Representation Memory Bank (MTRMB) method through two key technical innovations.
Experiments on MVtec AD and VisA datasets demonstrate MTRMB's superiority, achieving an average detection accuracy of 0.921 at the lowest forgetting rate.
arXiv Detail & Related papers (2025-02-10T06:49:54Z) - TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action [103.5952731807559]
We present TACO, a family of multi-modal large action models designed to improve performance on complex, multi-step, and multi-modal tasks.<n>During inference, TACO produces chains-of-thought-and-action (CoTA), executes intermediate steps by invoking external tools such as OCR, depth estimation and calculator.<n>This dataset enables TACO to learn complex reasoning and action paths, surpassing existing models trained on instruction tuning data with only direct answers.
arXiv Detail & Related papers (2024-12-07T00:42:04Z) - CAD-MLLM: Unifying Multimodality-Conditioned CAD Generation With MLLM [39.113795259823476]
We introduce the CAD-MLLM, the first system capable of generating parametric CAD models conditioned on the multimodal input.<n>We use advanced large language models (LLMs) to align the feature space across diverse multi-modalities data and CAD models' vectorized representations.<n>Our resulting dataset, named Omni-CAD, is the first multimodal CAD dataset that contains textual description, multi-view images, points, and command sequence for each CAD model.
arXiv Detail & Related papers (2024-11-07T18:31:08Z) - Noise-powered Multi-modal Knowledge Graph Representation Framework [52.95468915728721]
The rise of Multi-modal Pre-training highlights the necessity for a unified Multi-Modal Knowledge Graph representation learning framework.
We propose a novel SNAG method that utilizes a Transformer-based architecture equipped with modality-level noise masking.
Our approach achieves SOTA performance across a total of ten datasets, demonstrating its versatility.
arXiv Detail & Related papers (2024-03-11T15:48:43Z) - Towards Cross-Table Masked Pretraining for Web Data Mining [22.952238405240188]
We propose an innovative, generic, and efficient cross-table pretraining framework, dubbed as CM2.
Our experiments demonstrate CM2's state-of-the-art performance and validate that cross-table pretraining can enhance various downstream tasks.
arXiv Detail & Related papers (2023-07-10T02:27:38Z) - CoMAE: Single Model Hybrid Pre-training on Small-Scale RGB-D Datasets [50.6643933702394]
We present a single-model self-supervised hybrid pre-training framework for RGB and depth modalities, termed as CoMAE.
Our CoMAE presents a curriculum learning strategy to unify the two popular self-supervised representation learning algorithms: contrastive learning and masked image modeling.
arXiv Detail & Related papers (2023-02-13T07:09:45Z) - MMTM: Multi-Tasking Multi-Decoder Transformer for Math Word Problems [0.0]
We present a novel model MMTM that leverages multi-tasking and multi-decoder during pre-training.
MMTM model achieves better mathematical reasoning ability and generalisability.
We demonstrate by outperforming the best state of the art baseline models from Seq2Seq, GTS, and Graph2Tree with a relative improvement of 19.4% on an adversarial challenge dataset SVAMP.
arXiv Detail & Related papers (2022-06-02T19:48:36Z) - Multi-Stage Progressive Image Restoration [167.6852235432918]
We propose a novel synergistic design that can optimally balance these competing goals.
Our main proposal is a multi-stage architecture, that progressively learns restoration functions for the degraded inputs.
The resulting tightly interlinked multi-stage architecture, named as MPRNet, delivers strong performance gains on ten datasets.
arXiv Detail & Related papers (2021-02-04T18:57:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.