Related papers: Forecasting labels under distribution-shift for machine-guided sequence design

Forecasting labels under distribution-shift for machine-guided sequence design

URL: http://arxiv.org/abs/2211.10422v1
Date: Fri, 18 Nov 2022 18:35:50 GMT
Title: Forecasting labels under distribution-shift for machine-guided sequence design
Authors: Lauren Berk Wheelock, Stephen Malina, Jeffrey Gerold, Sam Sinai
Abstract summary: We propose a method to guide decision-making that forecasts the performance of high- throughput libraries. We show that our method outperforms baselines that naively use model scores to estimate library performance.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The ability to design and optimize biological sequences with specific functionalities would unlock enormous value in technology and healthcare. In recent years, machine learning-guided sequence design has progressed this goal significantly, though validating designed sequences in the lab or clinic takes many months and substantial labor. It is therefore valuable to assess the likelihood that a designed set contains sequences of the desired quality (which often lies outside the label distribution in our training data) before committing resources to an experiment. Forecasting, a prominent concept in many domains where feedback can be delayed (e.g. elections), has not been used or studied in the context of sequence design. Here we propose a method to guide decision-making that forecasts the performance of high-throughput libraries (e.g. containing $10^5$ unique variants) based on estimates provided by models, providing a posterior for the distribution of labels in the library. We show that our method outperforms baselines that naively use model scores to estimate library performance, which are the only tool available today for this purpose.

Related papers

TimeRecipe: A Time-Series Forecasting Recipe via Benchmarking Module Level Effectiveness [23.143208640116253]
TimeRecipe is a framework that systematically evaluates time-series forecasting methods at the module level.<n>TimeRecipe conducts over 10,000 experiments to assess the effectiveness of individual components.<n>Our results reveal that exhaustive exploration of the design space can yield models that outperform existing state-of-the-art methods.
arXiv Detail & Related papers (2025-06-06T19:11:48Z)
LC-Protonets: Multi-label Few-shot learning for world music audio tagging [65.72891334156706]
We introduce Label-Combination Prototypical Networks (LC-Protonets) to address the problem of multi-label few-shot classification. LC-Protonets generate one prototype per label combination, derived from the power set of labels present in the limited training items. Our method is applied to automatic audio tagging across diverse music datasets, covering various cultures and including both modern and traditional music.
arXiv Detail & Related papers (2024-09-17T15:13:07Z)
Exploring Beyond Logits: Hierarchical Dynamic Labeling Based on Embeddings for Semi-Supervised Classification [49.09505771145326]
We propose a Hierarchical Dynamic Labeling (HDL) algorithm that does not depend on model predictions and utilizes image embeddings to generate sample labels. Our approach has the potential to change the paradigm of pseudo-label generation in semi-supervised learning.
arXiv Detail & Related papers (2024-04-26T06:00:27Z)
How Does Beam Search improve Span-Level Confidence Estimation in Generative Sequence Labeling? [11.481435098152893]
This paper aims to provide some empirical insights on estimating model confidence for generative sequence labeling. As verified over six public datasets, we show that our proposed approach significantly reduces calibration errors of the predictions of a generative sequence labeling model.
arXiv Detail & Related papers (2022-12-21T05:01:01Z)
Stream-based active learning with linear models [0.7734726150561089]
In production, instead of performing random inspections to obtain product information, labels are collected by evaluating the information content of the unlabeled data. We propose a new strategy for the stream-based scenario, where instances are sequentially offered to the learner. The iterative aspect of the decision-making process is tackled by setting a threshold on the informativeness of the unlabeled data points.
arXiv Detail & Related papers (2022-07-20T13:15:23Z)
Squeezing Backbone Feature Distributions to the Max for Efficient Few-Shot Learning [3.1153758106426603]
Few-shot classification is a challenging problem due to the uncertainty caused by using few labelled samples. We propose a novel transfer-based method which aims at processing the feature vectors so that they become closer to Gaussian-like distributions. In the case of transductive few-shot learning where unlabelled test samples are available during training, we also introduce an optimal-transport inspired algorithm to boost even further the achieved performance.
arXiv Detail & Related papers (2021-10-18T16:29:17Z)
A Positive/Unlabeled Approach for the Segmentation of Medical Sequences using Point-Wise Supervision [3.883460584034766]
We propose a new method to efficiently segment medical imaging volumes or videos using point-wise annotations only. Our approach trains a deep learning model using an appropriate Positive/Unlabeled objective function using point-wise annotations. We show experimentally that our approach outperforms state-of-the-art methods tailored to the same problem.
arXiv Detail & Related papers (2021-07-18T09:13:33Z)
Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings. We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data. We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z)
Beta-CoRM: A Bayesian Approach for $n$-gram Profiles Analysis [0.0]
The flexibility of the proposed modelling allows to consider a straightforward approach to feature selection in the generative model. A slice sampling algorithm is derived for a fast inferential procedure, which is applied to synthetic and real data scenarios.
arXiv Detail & Related papers (2020-11-23T17:12:34Z)
BREEDS: Benchmarks for Subpopulation Shift [98.90314444545204]
We develop a methodology for assessing the robustness of models to subpopulation shift. We leverage the class structure underlying existing datasets to control the data subpopulations that comprise the training and test distributions. Applying this methodology to the ImageNet dataset, we create a suite of subpopulation shift benchmarks of varying granularity.
arXiv Detail & Related papers (2020-08-11T17:04:47Z)
Uncertainty-aware Self-training for Text Classification with Few Labels [54.13279574908808]
We study self-training as one of the earliest semi-supervised learning approaches to reduce the annotation bottleneck. We propose an approach to improve self-training by incorporating uncertainty estimates of the underlying neural network. We show our methods leveraging only 20-30 labeled samples per class for each task for training and for validation can perform within 3% of fully supervised pre-trained language models.
arXiv Detail & Related papers (2020-06-27T08:13:58Z)
Document Ranking with a Pretrained Sequence-to-Sequence Model [56.44269917346376]
We show how a sequence-to-sequence model can be trained to generate relevance labels as "target words" Our approach significantly outperforms an encoder-only model in a data-poor regime.
arXiv Detail & Related papers (2020-03-14T22:29:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.