A Highly Configurable Framework for Large-Scale Thermal Building Data Generation to drive Machine Learning Research
- URL: http://arxiv.org/abs/2512.00483v1
- Date: Sat, 29 Nov 2025 13:31:02 GMT
- Title: A Highly Configurable Framework for Large-Scale Thermal Building Data Generation to drive Machine Learning Research
- Authors: Thomas Krug, Fabian Raisch, Dominik Aimer, Markus Wirnsberger, Ferdinand Sigg, Felix Koch, Benjamin Schäfer, Benjamin Tischler,
- Abstract summary: BuilDa is designed to produce synthetic data of adequate quality and quantity for machine learning (ML) research.<n>It does not require profound building simulation knowledge to generate large volumes of data.<n>We demonstrate BuilDa by generating data and utilizing it for a transfer learning study involving the fine-tuning of 486 data-driven models.
- Score: 22.54521342959957
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data-driven modeling of building thermal dynamics is emerging as an increasingly important field of research for large-scale intelligent building control. However, research in data-driven modeling using machine learning (ML) techniques requires massive amounts of thermal building data, which is not easily available. Neither empirical public datasets nor existing data generators meet the needs of ML research in terms of data quality and quantity. Moreover, existing data generation approaches typically require expert knowledge in building simulation. To fill this gap, we present a thermal building data generation framework which we call BuilDa. BuilDa is designed to produce synthetic data of adequate quality and quantity for ML research. The framework does not require profound building simulation knowledge to generate large volumes of data. BuilDa uses a single-zone Modelica model that is exported as a Functional Mock-up Unit (FMU) and simulated in Python. We demonstrate BuilDa by generating data and utilizing it for a transfer learning study involving the fine-tuning of 486 data-driven models.
Related papers
- Generative Models for Synthetic Data: Transforming Data Mining in the GenAI Era [49.46005489386284]
This tutorial introduces the foundations and latest advances in synthetic data generation.<n> Attendees will gain actionable insights into leveraging generative synthetic data to enhance data mining research and practice.
arXiv Detail & Related papers (2025-08-27T05:04:07Z) - BUILDA: A Thermal Building Data Generation Framework for Transfer Learning [26.47874938214435]
Transfer learning can improve data-driven modeling of building thermal dynamics.<n>We present BuilDa, a framework for producing synthetic data of adequate quality and quantity for TL research.
arXiv Detail & Related papers (2025-08-18T08:01:37Z) - A preliminary data fusion study to assess the feasibility of Foundation Process-Property Models in Laser Powder Bed Fusion [0.0]
A major challenge that impedes the construction of foundation process-property models is data scarcity.<n>We generate experimental datasets from 17-4 PH and 316L stainless steels (SSs) in Laser Powder Bed Fusion (LPBF)<n>We then leverage Gaussian processes (GPs) for process-property modeling in various configurations to test if knowledge about one material system or property can be leveraged to build more accurate machine learning models for other material systems or properties.
arXiv Detail & Related papers (2025-03-20T19:29:38Z) - DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback [62.235925602004535]
DataEnvGym is a testbed of teacher environments for data generation agents.<n>It frames data generation as a sequential decision-making task, involving an agent and a data generation engine.<n>Students are iteratively trained and evaluated on generated data, and their feedback is reported to the agent after each iteration.
arXiv Detail & Related papers (2024-10-08T17:20:37Z) - A Benchmark Time Series Dataset for Semiconductor Fabrication Manufacturing Constructed using Component-based Discrete-Event Simulation Models [0.0]
This research is based on a benchmark model of an Intel semiconductor fabrication factory.
The time series dataset is constructed using discrete-event time trajectories.
The dataset can also be utilized in the machine learning community for behavioral analysis.
arXiv Detail & Related papers (2024-08-17T23:05:47Z) - Enabling High Data Throughput Reinforcement Learning on GPUs: A Domain Agnostic Framework for Data-Driven Scientific Research [90.91438597133211]
We introduce WarpSci, a framework designed to overcome crucial system bottlenecks in the application of reinforcement learning.
We eliminate the need for data transfer between the CPU and GPU, enabling the concurrent execution of thousands of simulations.
arXiv Detail & Related papers (2024-08-01T21:38:09Z) - Scaling Data-Driven Building Energy Modelling using Large Language Models [3.0309252269809264]
We propose a methodology to tackle the scalability challenges associated with the development of data-driven models for Building Management System.
We use Large Language Models (LLMs) to generate code that processes structured data from BMS and build data-driven models for BMS's specific requirements.
Our case study indicates that bi-sequential prompting under the prompt template can achieve a high success rate of code generation and code accuracy, and significantly reduce human labor costs.
arXiv Detail & Related papers (2024-07-03T19:34:24Z) - Scalable Diffusion for Materials Generation [99.71001883652211]
We develop a unified crystal representation that can represent any crystal structure (UniMat)
UniMat can generate high fidelity crystal structures from larger and more complex chemical systems.
We propose additional metrics for evaluating generative models of materials.
arXiv Detail & Related papers (2023-10-18T15:49:39Z) - TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations.
We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z) - Optimizing the AI Development Process by Providing the Best Support
Environment [0.756282840161499]
Main stages of machine learning are problem understanding, data management, model building, model deployment and maintenance.
The framework was built using python language to perform data augmentation using deep learning advancements.
arXiv Detail & Related papers (2023-04-29T00:44:50Z) - Advancing Reacting Flow Simulations with Data-Driven Models [50.9598607067535]
Key to effective use of machine learning tools in multi-physics problems is to couple them to physical and computer models.
The present chapter reviews some of the open opportunities for the application of data-driven reduced-order modeling of combustion systems.
arXiv Detail & Related papers (2022-09-05T16:48:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.