Benchmarking Simulacra AI's Quantum Accurate Synthetic Data Generation for Chemical Sciences
- URL: http://arxiv.org/abs/2511.07433v1
- Date: Thu, 30 Oct 2025 19:19:56 GMT
- Title: Benchmarking Simulacra AI's Quantum Accurate Synthetic Data Generation for Chemical Sciences
- Authors: Fabio Falcioni, Elena Orlova, Timothy Heightman, Philip Mantrov, Aleksei Ustimenko,
- Abstract summary: We benchmark simulacra's synthetic data generation pipeline against a state-of-the-art Microsoft pipeline on a dataset of small to large systems.<n>Our findings show that Simulacra's Large Wavefunction Models (LWM) pipeline, paired with state-of-the-art Variational Monte Carlo (VMC) sampling algorithms, reduces data generation costs by 15-50x.
- Score: 9.886732105753714
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we benchmark \simulacra's synthetic data generation pipeline against a state-of-the-art Microsoft pipeline on a dataset of small to large systems. By analyzing the energy quality, autocorrelation times, and effective sample size, our findings show that Simulacra's Large Wavefunction Models (LWM) pipeline, paired with state-of-the-art Variational Monte Carlo (VMC) sampling algorithms, reduces data generation costs by 15-50x, while maintaining parity in energy accuracy, and 2-3x compared to traditional CCSD methods on the scale of amino acids. This enables the creation of affordable, large-scale \textit{ab-initio} datasets, accelerating AI-driven optimization and discovery in the pharmaceutical industry and beyond. Our improvements are based on a novel and proprietary sampling scheme called Replica Exchange with Langevin Adaptive eXploration (RELAX).
Related papers
- A Reinforcement Learning Approach to Synthetic Data Generation [8.293402602656736]
We introduce RLSyn, a novel framework that models the data generator as a policy over patient records.<n>We benchmark it against state-of-the-art generative adversarial networks (GANs) and diffusion-based methods across privacy, utility, and fidelity evaluations.
arXiv Detail & Related papers (2025-12-24T19:26:37Z) - Predictive Inorganic Synthesis based on Machine Learning using Small Data sets: a case study of size-controlled Cu Nanoparticles [0.0]
Copper nanoparticles (Cu NPs) have a broad applicability, yet their synthesis is sensitive to subtle changes in reaction parameters.<n>This study explores Machine Learning to predict the size of Cu NPs from microwave-assisted polyol synthesis using a small data set of 25 in-house performed syntheses.
arXiv Detail & Related papers (2025-12-18T13:53:08Z) - GEM+: Scalable State-of-the-Art Private Synthetic Data with Generator Networks [9.432150710329607]
We introduce GEM+, which integrates AIM's adaptive measurement framework with GEM's scalable generator network.<n>Our experiments show that GEM+ outperforms AIM in both utility and scalability, delivering state-of-the-art results.
arXiv Detail & Related papers (2025-11-12T19:18:43Z) - Quantum Synthetic Data Generation for Industrial Bioprocess Monitoring [0.0]
Data scarcity and sparsity in bio-manufacturing poses challenges for accurate model development, process monitoring, and optimization.<n>We propose the use of a Quantum Wasserstein Generative Adrial Network with Gradient Penalty (QWGAN-GP) to generate synthetic time series data for industrially relevant processes.
arXiv Detail & Related papers (2025-10-20T16:04:39Z) - Scaling Transformer-Based Novel View Synthesis Models with Token Disentanglement and Synthetic Data [53.040873127309766]
We propose a token disentanglement process within the transformer architecture, enhancing feature separation and ensuring more effective learning.<n>Our method outperforms existing models on both in-dataset and cross-dataset evaluations.
arXiv Detail & Related papers (2025-09-08T17:58:06Z) - Scaling Laws of Synthetic Data for Language Models [125.41600201811417]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets.<n>Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z) - Ultra-Resolution Adaptation with Ease [62.56434979517156]
We propose a set of key guidelines for ultra-resolution adaptation termed emphURAE.<n>We show that tuning minor components of the weight matrices outperforms widely-used low-rank adapters when synthetic data are unavailable.<n>Experiments validate that URAE achieves comparable 2K-generation performance to state-of-the-art closed-source models like FLUX1.1 [Pro] Ultra with only 3K samples and 2K iterations.
arXiv Detail & Related papers (2025-03-20T16:44:43Z) - Little Giants: Synthesizing High-Quality Embedding Data at Scale [71.352883755806]
We introduce SPEED, a framework that aligns open-source small models to efficiently generate large-scale embedding data.
SPEED uses only less than 1/10 of the GPT API calls, outperforming the state-of-the-art embedding model E5_mistral when both are trained solely on their synthetic data.
arXiv Detail & Related papers (2024-10-24T10:47:30Z) - AI enhanced data assimilation and uncertainty quantification applied to
Geological Carbon Storage [0.0]
We introduce the Surrogate-based hybrid ESMDA (SH-ESMDA), an adaptation of the traditional Ensemble Smoother with Multiple Data Assimilation (ESMDA)
We also introduce Surrogate-based Hybrid RML (SH-RML), a variational data assimilation approach that relies on the randomized maximum likelihood (RML)
Our comparative analyses show that SH-RML offers better uncertainty compared to conventional ESMDA for the case study.
arXiv Detail & Related papers (2024-02-09T00:24:46Z) - Retrosynthesis prediction enhanced by in-silico reaction data
augmentation [66.5643280109899]
We present RetroWISE, a framework that employs a base model inferred from real paired data to perform in-silico reaction generation and augmentation.
On three benchmark datasets, RetroWISE achieves the best overall performance against state-of-the-art models.
arXiv Detail & Related papers (2024-01-31T07:40:37Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Lamarr: LHCb ultra-fast simulation based on machine learning models deployed within Gauss [0.0]
We discuss Lamarr, a framework to speed-up the simulation production parameterizing both the detector response and the reconstruction algorithms of the LHCb experiment.
Deep Generative Models powered by several algorithms and strategies are employed to effectively parameterize the high-level response of the single components of the LHCb detector.
arXiv Detail & Related papers (2023-03-20T20:18:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.