Large Language Models for Limited Noisy Data: A Gravitational Wave Identification Study
- URL: http://arxiv.org/abs/2512.04031v1
- Date: Wed, 03 Dec 2025 18:13:01 GMT
- Title: Large Language Models for Limited Noisy Data: A Gravitational Wave Identification Study
- Authors: Yixuan Li, Yuhao Lu, Yang Liu, Liang Li, R. Ruffini, Di Li, Rong-Gen Cai, Xiaoyan Zhu, Wenbin Lin, Yu Wang,
- Abstract summary: This work investigates whether large language models (LLMs) offer advantages over traditional neural networks for astronomical data processing.<n>Using only 90 LIGO events, finetuned LLMs achieve 97.4% accuracy for identifying signals.<n>The same strategy may extend to other astronomical domains with similar noise properties, such as radio or pulsar observations.
- Score: 26.44274425955736
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work investigates whether large language models (LLMs) offer advantages over traditional neural networks for astronomical data processing, in regimes with non-Gaussian, non-stationary noise and limited labeled samples. Gravitational wave observations provide an suitable test case, using only 90 LIGO events, finetuned LLMs achieve 97.4\% accuracy for identifying signals. Further experiments show that, in contrast to traditional networks that rely on large simulated datasets, additional simulated samples do not improve LLM performance, while scaling studies reveal predictable gains with increasing model size and dataset size. These results indicate that LLMs can extract discriminative structure directly from observational data and provide an efficient assessment for gravitational wave identification. The same strategy may extend to other astronomical domains with similar noise properties, such as radio or pulsar observations.
Related papers
- Data-Efficient Learning of Anomalous Diffusion with Wavelet Representations: Enabling Direct Learning from Experimental Trajectories [5.086421870787772]
We introduce a wavelet-based representation of anomalous diffusion that enables data-efficient learning directly from experimental recordings.<n>We first evaluate the wavelet representation on simulated trajectories from the andi-datasets benchmark.<n>We then use this representation to learn directly from experimental SPT trajectories of fluorescent beads diffusing in F-actin networks.
arXiv Detail & Related papers (2025-12-09T11:52:23Z) - Simulation-Based Pretraining and Domain Adaptation for Astronomical Time Series with Minimal Labeled Data [0.12744523252873352]
We present a pre-training approach that leverages simulations, significantly reducing the need for labeled examples from real observations.<n>Our models, trained on simulated data from multiple astronomical surveys (ZTF and LSST), learn generalizable representations that transfer effectively to downstream tasks.<n>Remarkably, our models exhibit effective zero-shot transfer capabilities, achieving comparable performance on future telescope (LSST) simulations when trained solely on existing telescope (ZTF) data.
arXiv Detail & Related papers (2025-10-14T20:07:14Z) - On the Shape of Latent Variables in a Denoising VAE-MoG: A Posterior Sampling-Based Study [51.56484100374058]
We explore the latent space of a denoising variational autoencoder with a mixture-of-Gaussians prior (VAE-MoG)<n>To evaluate how well the model captures the underlying structure, we use Hamiltonian Monte Carlo (HMC) to draw posterior samples conditioned on clean inputs, and compare them to the encoder's outputs from noisy data.<n>Although the model reconstructs signals accurately, statistical comparisons reveal a clear mismatch in the latent space.
arXiv Detail & Related papers (2025-09-29T18:33:09Z) - A simulation-based training framework for machine-learning applications in ARPES [0.0]
We introduce an open-source synthetic ARPES spectra simulator - aurelia - for generating the large datasets necessary to train machine learning models.<n>We benchmark the simulation-trained model against actual experimental data and find that it can assess the spectra quality more accurately than human analysis.
arXiv Detail & Related papers (2025-08-21T21:59:09Z) - Angles Don't Lie: Unlocking Training-Efficient RL Through the Model's Own Signals [49.17123504516502]
CurrentReinforcement Fine-tuning (RFT) paradigms for Large Language Models (LLMs) suffer from inefficiency due to redundant exposure of identical queries under uniform data sampling.<n>We propose a Gradient-driven Angle-Informed Navigated RL framework.<n>By leveraging the model's intrinsic angle concentration signal, GAIN-RL dynamically selects training data in each epoch, ensuring consistently impactful gradient updates.
arXiv Detail & Related papers (2025-06-02T21:40:38Z) - Ionospheric Scintillation Forecasting Using Machine Learning [0.4369058206183195]
The research focuses on developing a machine learning (ML) model that can forecast the intensity of amplitude scintillation.
The XGBoost model emerged as the most effective, demonstrating a remarkable 77% prediction accuracy when trained with a balanced dataset.
arXiv Detail & Related papers (2024-08-28T08:21:01Z) - Anomaly Detection of Tabular Data Using LLMs [54.470648484612866]
We show that pre-trained large language models (LLMs) are zero-shot batch-level anomaly detectors.
We propose an end-to-end fine-tuning strategy to bring out the potential of LLMs in detecting real anomalies.
arXiv Detail & Related papers (2024-06-24T04:17:03Z) - Diffusion posterior sampling for simulation-based inference in tall data settings [53.17563688225137]
Simulation-based inference ( SBI) is capable of approximating the posterior distribution that relates input parameters to a given observation.
In this work, we consider a tall data extension in which multiple observations are available to better infer the parameters of the model.
We compare our method to recently proposed competing approaches on various numerical experiments and demonstrate its superiority in terms of numerical stability and computational cost.
arXiv Detail & Related papers (2024-04-11T09:23:36Z) - To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs.
We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting.
Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z) - Convolutional Neural Networks for the classification of glitches in
gravitational-wave data streams [52.77024349608834]
We classify transient noise signals (i.e.glitches) and gravitational waves in data from the Advanced LIGO detectors.
We use models with a supervised learning approach, both trained from scratch using the Gravity Spy dataset.
We also explore a self-supervised approach, pre-training models with automatically generated pseudo-labels.
arXiv Detail & Related papers (2023-03-24T11:12:37Z) - Decision Forest Based EMG Signal Classification with Low Volume Dataset
Augmented with Random Variance Gaussian Noise [51.76329821186873]
We produce a model that can classify six different hand gestures with a limited number of samples that generalizes well to a wider audience.
We appeal to a set of more elementary methods such as the use of random bounds on a signal, but desire to show the power these methods can carry in an online setting.
arXiv Detail & Related papers (2022-06-29T23:22:18Z) - Flow-Based Likelihoods for Non-Gaussian Inference [0.0]
We investigate the use of data-driven likelihoods to bypass a key assumption made in many scientific analyses.
We show that the likelihood can be reconstructed to a precision equal to that of sampling error due to a finite sample size.
By introducing a suite of tests that can capture different levels of NG in the data, we show that the success or failure of traditional data-driven likelihoods can be tied back to the structure of the NG in the data.
arXiv Detail & Related papers (2020-07-10T18:00:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.