Towards Label-Free Biological Reasoning Synthetic Dataset Creation via Uncertainty Filtering
- URL: http://arxiv.org/abs/2510.05871v1
- Date: Tue, 07 Oct 2025 12:40:37 GMT
- Title: Towards Label-Free Biological Reasoning Synthetic Dataset Creation via Uncertainty Filtering
- Authors: Josefa Lia Stoisser, Lawrence Phillips, Aditya Misra, Tom A. Lamb, Philip Torr, Marc Boubnovski Martell, Julien Fauqueur, Kaspar Märtens,
- Abstract summary: Synthetic chain-of-thought (CoT) traces are widely used to train large reasoning models (LRMs)<n>We propose a label-free alternative: uncertainty-based filtering, which uses a model's own confidence as a substitute for external labels.
- Score: 20.433272447607106
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Synthetic chain-of-thought (CoT) traces are widely used to train large reasoning models (LRMs), improving generalization by providing step-level supervision. Yet most approaches require ground-truth labels to seed or filter these traces - an expensive bottleneck in domains like biology where wet-lab data are scarce. We propose a label-free alternative: uncertainty-based filtering, which uses a model's own confidence - quantified through established uncertainty metrics like self-consistency and predictive perplexity - as a substitute for external labels. We sample multiple reasoning traces and retain only low-uncertainty subsets. Applied to biological perturbation prediction, a domain where wet-lab labels are especially costly, we show that the filtered subset has higher accuracy, and that supervised fine-tuning (SFT) on uncertainty-filtered data outperforms unfiltered synthetic data, narrows the gap to ground-truth training, and surpasses strong LRM baselines. Ablations show that per-class filtering corrects for class-specific uncertainty scales and that hybrid uncertainty metrics yield higher-quality datasets. Our results suggest that model-internal confidence is a powerful signal for efficient reasoning dataset creation, enabling LRMs in domains where supervision is expensive.
Related papers
- Heterogeneous Multisource Transfer Learning via Model Averaging for Positive-Unlabeled Data [2.030810815519794]
We propose a novel transfer learning framework that integrates information from heterogeneous data sources without direct data sharing.<n>For each source domain type, a tailored logistic regression model is conducted, and knowledge is transferred to the PU target domain through model averaging.<n>Our method outperforms other comparative methods in terms of predictive accuracy and robustness, especially under limited labeled data and heterogeneous environments.
arXiv Detail & Related papers (2025-11-14T03:15:31Z) - Semi-Supervised Regression with Heteroscedastic Pseudo-Labels [50.54050677867914]
We propose an uncertainty-aware pseudo-labeling framework that dynamically adjusts pseudo-label influence from a bi-level optimization perspective.<n>We provide theoretical insights and extensive experiments to validate our approach across various benchmark SSR datasets.
arXiv Detail & Related papers (2025-10-17T03:06:23Z) - Latent Noise Injection for Private and Statistically Aligned Synthetic Data Generation [7.240170769827935]
Synthetic data generation has become essential for scalable, privacy-preserving statistical analysis.<n>We propose a Latent Noise Injection method using Masked Autoregressive Flows (MAF)<n>Instead of directly sampling from the trained model, our method perturbs each data point in the latent space and maps it back to the data domain.
arXiv Detail & Related papers (2025-06-19T22:22:57Z) - Suitability Filter: A Statistical Framework for Classifier Evaluation in Real-World Deployment Settings [33.080398349395686]
We propose a novel framework designed to detect performance deterioration by utilizing suitability signals.<n>We aggregate suitability signals for both test and user data and compare these empirical distributions.<n>This enables proactive mitigation of potential failures in high-stakes applications.
arXiv Detail & Related papers (2025-05-28T13:37:04Z) - The Decaying Missing-at-Random Framework: Model Doubly Robust Causal Inference with Partially Labeled Data [8.916614661563893]
We introduce a missing-at-random (decaying MAR) framework and associated approaches for doubly robust causal inference.<n>This simultaneously addresses selection bias in the labeling mechanism and the extreme imbalance between labeled and unlabeled groups.<n>To ensure robust causal conclusions, we propose a bias-reduced SS estimator for the average treatment effect.
arXiv Detail & Related papers (2023-05-22T07:37:12Z) - Uncertainty-Aware Source-Free Adaptive Image Super-Resolution with Wavelet Augmentation Transformer [60.31021888394358]
Unsupervised Domain Adaptation (UDA) can effectively address domain gap issues in real-world image Super-Resolution (SR)
We propose a SOurce-free Domain Adaptation framework for image SR (SODA-SR) to address this issue, i.e., adapt a source-trained model to a target domain with only unlabeled target data.
arXiv Detail & Related papers (2023-03-31T03:14:44Z) - Uncertainty-aware Self-training for Low-resource Neural Sequence
Labeling [29.744621356187764]
This paper presents SeqUST, a novel uncertain-aware self-training framework for Neural sequence labeling (NSL)
We incorporate Monte Carlo (MC) dropout in Bayesian neural network (BNN) to perform uncertainty estimation at the token level and then select reliable language tokens from unlabeled data.
A well-designed masked sequence labeling task with a noise-robust loss supports robust training, which aims to suppress the problem of noisy pseudo labels.
arXiv Detail & Related papers (2023-02-17T02:40:04Z) - Delving into Probabilistic Uncertainty for Unsupervised Domain Adaptive
Person Re-Identification [54.174146346387204]
We propose an approach named probabilistic uncertainty guided progressive label refinery (P$2$LR) for domain adaptive person re-identification.
A quantitative criterion is established to measure the uncertainty of pseudo labels and facilitate the network training.
Our method outperforms the baseline by 6.5% mAP on the Duke2Market task, while surpassing the state-of-the-art method by 2.5% mAP on the Market2MSMT task.
arXiv Detail & Related papers (2021-12-28T07:40:12Z) - Unsupervised Robust Domain Adaptation without Source Data [75.85602424699447]
We study the problem of robust domain adaptation in the context of unavailable target labels and source data.
We show a consistent performance improvement of over $10%$ in accuracy against the tested baselines on four benchmark datasets.
arXiv Detail & Related papers (2021-03-26T16:42:28Z) - Self-training Avoids Using Spurious Features Under Domain Shift [54.794607791641745]
In unsupervised domain adaptation, conditional entropy minimization and pseudo-labeling work even when the domain shifts are much larger than those analyzed by existing theory.
We identify and analyze one particular setting where the domain shift can be large, but certain spurious features correlate with label in the source domain but are independent label in the target.
arXiv Detail & Related papers (2020-06-17T17:51:42Z) - Towards Discriminability and Diversity: Batch Nuclear-norm Maximization
under Label Insufficient Situations [154.51144248210338]
Batch Nuclear-norm Maximization (BNM) is proposed to boost the learning under label insufficient learning scenarios.
BNM outperforms competitors and works well with existing well-known methods.
arXiv Detail & Related papers (2020-03-27T05:04:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.