MultiMat: Multimodal Program Synthesis for Procedural Materials using Large Multimodal Models
- URL: http://arxiv.org/abs/2509.22151v1
- Date: Fri, 26 Sep 2025 10:10:25 GMT
- Title: MultiMat: Multimodal Program Synthesis for Procedural Materials using Large Multimodal Models
- Authors: Jonas Belouadi, Tamy Boubekeur, Adrien Kaiser,
- Abstract summary: We present MultiMat, a program synthesis framework that processes both visual and textual graph representations.<n>We train our models on a new dataset of production-quality procedural materials and combine them with a constrained tree search inference algorithm.<n>Our experimental results show that our multimodal program synthesis method is more efficient in both unconditional and conditional graph synthesis.
- Score: 9.025489303067008
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Material node graphs are programs that generate the 2D channels of procedural materials, including geometry such as roughness and displacement maps, and reflectance such as albedo and conductivity maps. They are essential in computer graphics for representing the appearance of virtual 3D objects parametrically and at arbitrary resolution. In particular, their directed acyclic graph structures and intermediate states provide an intuitive understanding and workflow for interactive appearance modeling. Creating such graphs is a challenging task and typically requires professional training. While recent neural program synthesis approaches attempt to simplify this process, they solely represent graphs as textual programs, failing to capture the inherently visual-spatial nature of node graphs that makes them accessible to humans. To address this gap, we present MultiMat, a multimodal program synthesis framework that leverages large multimodal models to process both visual and textual graph representations for improved generation of procedural material graphs. We train our models on a new dataset of production-quality procedural materials and combine them with a constrained tree search inference algorithm that ensures syntactic validity while efficiently navigating the program space. Our experimental results show that our multimodal program synthesis method is more efficient in both unconditional and conditional graph synthesis with higher visual quality and fidelity than text-only baselines, establishing new state-of-the-art performance.
Related papers
- Bridging Visualization and Optimization: Multimodal Large Language Models on Graph-Structured Combinatorial Optimization [56.17811386955609]
Graph-structured challenges are inherently difficult due to their nonlinear and intricate nature.<n>In this study, we propose transforming graphs into images to preserve their higher-order structural features accurately.<n>By combining the innovative paradigm powered by multimodal large language models with simple search techniques, we aim to develop a novel and effective framework.
arXiv Detail & Related papers (2025-01-21T08:28:10Z) - Instance-Aware Graph Prompt Learning [71.26108600288308]
We introduce Instance-Aware Graph Prompt Learning (IA-GPL) in this paper.
The process involves generating intermediate prompts for each instance using a lightweight architecture.
Experiments conducted on multiple datasets and settings showcase the superior performance of IA-GPL compared to state-of-the-art baselines.
arXiv Detail & Related papers (2024-11-26T18:38:38Z) - InstructLayout: Instruction-Driven 2D and 3D Layout Synthesis with Semantic Graph Prior [23.536285325566013]
Comprehending natural language instructions is a charming property for both 2D and 3D layout synthesis systems.<n>Existing methods implicitly model object joint distributions and express object relations, hindering generation's controllability synthesis systems.<n>We introduce Instruct, a novel generative framework that integrates a semantic graph prior and a layout decoder.
arXiv Detail & Related papers (2024-07-10T12:13:39Z) - Message Detouring: A Simple Yet Effective Cycle Representation for
Expressive Graph Learning [4.085624738017079]
We introduce the concept of textitmessage detouring to hierarchically characterize cycle representation throughout the entire graph.
Message detouring can significantly outperform current counterpart approaches on various benchmark datasets.
arXiv Detail & Related papers (2024-02-12T22:06:37Z) - When Graph Data Meets Multimodal: A New Paradigm for Graph Understanding
and Reasoning [54.84870836443311]
The paper presents a new paradigm for understanding and reasoning about graph data by integrating image encoding and multimodal technologies.
This approach enables the comprehension of graph data through an instruction-response format, utilizing GPT-4V's advanced capabilities.
The study evaluates this paradigm on various graph types, highlighting the model's strengths and weaknesses, particularly in Chinese OCR performance and complex reasoning tasks.
arXiv Detail & Related papers (2023-12-16T08:14:11Z) - Sparse Graphical Linear Dynamical Systems [1.6635799895254402]
Time-series datasets are central in machine learning with applications in numerous fields of science and engineering.
This work proposes a novel approach to bridge the gap by introducing a joint graphical modeling framework.
We present DGLASSO, a new inference method within this framework that implements an efficient block alternating majorization-minimization algorithm.
arXiv Detail & Related papers (2023-07-06T14:10:02Z) - GM-NeRF: Learning Generalizable Model-based Neural Radiance Fields from
Multi-view Images [79.39247661907397]
We introduce an effective framework Generalizable Model-based Neural Radiance Fields to synthesize free-viewpoint images.
Specifically, we propose a geometry-guided attention mechanism to register the appearance code from multi-view 2D images to a geometry proxy.
arXiv Detail & Related papers (2023-03-24T03:32:02Z) - Template based Graph Neural Network with Optimal Transport Distances [11.56532171513328]
Current Graph Neural Networks (GNN) architectures rely on two important components: node features embedding through message passing, and aggregation with a specialized form of pooling.
We propose in this work a novel point of view, which places distances to some learnable graph templates at the core of the graph representation.
This distance embedding is constructed thanks to an optimal transport distance: the Fused Gromov-Wasserstein (FGW) distance.
arXiv Detail & Related papers (2022-05-31T12:24:01Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z) - A Robust and Generalized Framework for Adversarial Graph Embedding [73.37228022428663]
We propose a robust framework for adversarial graph embedding, named AGE.
AGE generates the fake neighbor nodes as the enhanced negative samples from the implicit distribution.
Based on this framework, we propose three models to handle three types of graph data.
arXiv Detail & Related papers (2021-05-22T07:05:48Z) - Comparison of Syntactic and Semantic Representations of Programs in
Neural Embeddings [1.0878040851638]
It compares graph convolutional networks using different graph representations in the task of program embedding.
It shows that the sparsity of control flow graphs and the implicit aggregation of graph convolutional networks cause these models to perform worse than naive models.
arXiv Detail & Related papers (2020-01-24T21:30:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.