AutoSciDACT: Automated Scientific Discovery through Contrastive Embedding and Hypothesis Testing
- URL: http://arxiv.org/abs/2510.21935v1
- Date: Fri, 24 Oct 2025 18:07:50 GMT
- Title: AutoSciDACT: Automated Scientific Discovery through Contrastive Embedding and Hypothesis Testing
- Authors: Samuel Bright-Thonney, Christina Reissel, Gaia Grosso, Nathaniel Woodward, Katya Govorkova, Andrzej Novak, Sang Eon Park, Eric Moreno, Philip Harris,
- Abstract summary: We introduce AutoSciDACT (Automated Scientific Discovery with Anomalous Contrastive Testing), a general-purpose pipeline for detecting novelty in scientific data.<n>We perform experiments across a range of astronomical, physical, biological, image, and synthetic datasets, demonstrating strong sensitivity to small injections of anomalous data.
- Score: 0.3176157258742961
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Novelty detection in large scientific datasets faces two key challenges: the noisy and high-dimensional nature of experimental data, and the necessity of making statistically robust statements about any observed outliers. While there is a wealth of literature on anomaly detection via dimensionality reduction, most methods do not produce outputs compatible with quantifiable claims of scientific discovery. In this work we directly address these challenges, presenting the first step towards a unified pipeline for novelty detection adapted for the rigorous statistical demands of science. We introduce AutoSciDACT (Automated Scientific Discovery with Anomalous Contrastive Testing), a general-purpose pipeline for detecting novelty in scientific data. AutoSciDACT begins by creating expressive low-dimensional data representations using a contrastive pre-training, leveraging the abundance of high-quality simulated data in many scientific domains alongside expertise that can guide principled data augmentation strategies. These compact embeddings then enable an extremely sensitive machine learning-based two-sample test using the New Physics Learning Machine (NPLM) framework, which identifies and statistically quantifies deviations in observed data relative to a reference distribution (null hypothesis). We perform experiments across a range of astronomical, physical, biological, image, and synthetic datasets, demonstrating strong sensitivity to small injections of anomalous data across all domains.
Related papers
- Synthetic-Powered Multiple Testing with FDR Control [29.516221063294157]
We introduce SynthBH, a synthetic-powered multiple testing procedure that safely leverages synthetic data.<n>We prove that SynthBH guarantees finite-sample, distribution-free FDR control under a mild PRDS-type positive dependence condition.<n>It enhances the sample efficiency and may boost the power when synthetic data are of high quality, while controlling the FDR at a user-specified level regardless of their quality.
arXiv Detail & Related papers (2026-02-18T18:36:24Z) - WildSci: Advancing Scientific Reasoning from In-the-Wild Literature [50.16160754134139]
We introduce WildSci, a new dataset of domain-specific science questions automatically synthesized from peer-reviewed literature.<n>By framing complex scientific reasoning tasks in a multiple-choice format, we enable scalable training with well-defined reward signals.<n>Experiments on a suite of scientific benchmarks demonstrate the effectiveness of our dataset and approach.
arXiv Detail & Related papers (2026-01-09T06:35:23Z) - A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers [251.23085679210206]
Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research.<n>This survey reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate.<n>We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge.
arXiv Detail & Related papers (2025-08-28T18:30:52Z) - Can synthetic data reproduce real-world findings in epidemiology? A replication study using tree-based generative AI [0.6268282038459305]
We propose the use of adversarial random forests (ARF) as an efficient and convenient method for synthesizing epidemiological data.<n>We replicated statistical analyses from six epidemiological publications and compared original with synthetic results.<n>Results from multiple synthetic data replications consistently aligned with original findings.
arXiv Detail & Related papers (2025-08-19T22:51:40Z) - Valid Inference with Imperfect Synthetic Data [39.10587411316875]
We introduce a new estimator based on generalized method of moments.<n>We find that interactions between the moment residuals of synthetic data and those of real data can greatly improve estimates of the target parameter.
arXiv Detail & Related papers (2025-08-08T18:32:52Z) - Operationalizing Serendipity: Multi-Agent AI Workflows for Enhanced Materials Characterization with Theory-in-the-Loop [0.0]
SciLink is an open-source, multi-agent artificial intelligence framework designed to operationalize serendipity in materials research.<n>It creates a direct, automated link between experimental observation, novelty assessment, and theoretical simulations.<n>We show its application to atomic-resolution and hyperspectral data, its capacity to integrate real-time human expert guidance, and its ability to close the research loop.
arXiv Detail & Related papers (2025-08-07T04:59:17Z) - We Need Improved Data Curation and Attribution in AI for Scientific Discovery [3.831097744380551]
We examine the role of synthetic data as opposed to that of real experimental data for scientific research.<n>Nearly three-quarters of experimental datasets available on open-access platforms have relatively low adoption rates.<n>We propose supplementing ongoing efforts in automating synthetic data detection by increasing the focus on watermarking real experimental data.
arXiv Detail & Related papers (2025-04-03T11:07:52Z) - Amortized Conditional Independence Testing [6.954510776782872]
ACID is a transformer-based neural network architecture that learns to test for conditional independence.<n>It consistently achieves state-of-the-art performance against existing baselines under multiple metrics.<n>It is able to generalize robustly to unseen sample sizes, dimensionalities, as well as non-linearities with a remarkably low inference time.
arXiv Detail & Related papers (2025-02-28T10:29:56Z) - Understanding complex crowd dynamics with generative neural simulators [43.02251339321427]
We use our data-driven Neural Crowd Simulator (NeCS) to train on large-scale data and validate against key statistical features of crowd dynamics.<n>We show that we can perform effective surrogate crowd dynamics experiments without training on specific scenarios.<n>We also uncover the vision-guided and topological nature of N-body interactions.
arXiv Detail & Related papers (2024-12-02T13:42:36Z) - Discovering physical laws with parallel symbolic enumeration [67.36739393470869]
We introduce parallel symbolic enumeration (PSE) to efficiently distill generic mathematical expressions from limited data.<n>Experiments show that PSE achieves higher accuracy and faster computation compared to the state-of-the-art baseline algorithms.<n> PSE represents an advance in accurate and efficient data-driven discovery of symbolic, interpretable models.
arXiv Detail & Related papers (2024-07-05T10:41:15Z) - ChemVise: Maximizing Out-of-Distribution Chemical Detection with the
Novel Application of Zero-Shot Learning [60.02503434201552]
This research proposes learning approximations of complex exposures from training sets of simple ones.
We demonstrate this approach to synthetic sensor responses surprisingly improves the detection of out-of-distribution obscured chemical analytes.
arXiv Detail & Related papers (2023-02-09T20:19:57Z) - GFlowNets for AI-Driven Scientific Discovery [74.27219800878304]
We present a new probabilistic machine learning framework called GFlowNets.
GFlowNets can be applied in the modeling, hypotheses generation and experimental design stages of the experimental science loop.
We argue that GFlowNets can become a valuable tool for AI-driven scientific discovery.
arXiv Detail & Related papers (2023-02-01T17:29:43Z) - Trust Your $\
abla$: Gradient-based Intervention Targeting for Causal Discovery [49.084423861263524]
In this work, we propose a novel Gradient-based Intervention Targeting method, abbreviated GIT.
GIT 'trusts' the gradient estimator of a gradient-based causal discovery framework to provide signals for the intervention acquisition function.
We provide extensive experiments in simulated and real-world datasets and demonstrate that GIT performs on par with competitive baselines.
arXiv Detail & Related papers (2022-11-24T17:04:45Z) - Learning Neural Causal Models with Active Interventions [83.44636110899742]
We introduce an active intervention-targeting mechanism which enables a quick identification of the underlying causal structure of the data-generating process.
Our method significantly reduces the required number of interactions compared with random intervention targeting.
We demonstrate superior performance on multiple benchmarks from simulated to real-world data.
arXiv Detail & Related papers (2021-09-06T13:10:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.