Lerna: Transformer Architectures for Configuring Error Correction Tools
for Short- and Long-Read Genome Sequencing
- URL: http://arxiv.org/abs/2112.10068v1
- Date: Sun, 19 Dec 2021 05:59:26 GMT
- Title: Lerna: Transformer Architectures for Configuring Error Correction Tools
for Short- and Long-Read Genome Sequencing
- Authors: Atul Sharma, Pranjal Jain, Ashraf Mahgoub, Zihan Zhou, Kanak Mahadik,
and Somali Chaterji
- Abstract summary: We introduce Lerna for the automated configuration of k-mer-based EC tools.
We show that the best k-mer value can vary for different datasets, even for the same EC tool.
We also show that our attention-based models have significant runtime improvement for the entire pipeline.
- Score: 5.911600622951255
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sequencing technologies are prone to errors, making error correction (EC)
necessary for downstream applications. EC tools need to be manually configured
for optimal performance. We find that the optimal parameters (e.g., k-mer size)
are both tool- and dataset-dependent. Moreover, evaluating the performance
(i.e., Alignment-rate or Gain) of a given tool usually relies on a reference
genome, but quality reference genomes are not always available. We introduce
Lerna for the automated configuration of k-mer-based EC tools. Lerna first
creates a language model (LM) of the uncorrected genomic reads; then,
calculates the perplexity metric to evaluate the corrected reads for different
parameter choices. Next, it finds the one that produces the highest alignment
rate without using a reference genome. The fundamental intuition of our
approach is that the perplexity metric is inversely correlated with the quality
of the assembly after error correction. Results: First, we show that the best
k-mer value can vary for different datasets, even for the same EC tool. Second,
we show the gains of our LM using its component attention-based transformers.
We show the model's estimation of the perplexity metric before and after error
correction. The lower the perplexity after correction, the better the k-mer
size. We also show that the alignment rate and assembly quality computed for
the corrected reads are strongly negatively correlated with the perplexity,
enabling the automated selection of k-mer values for better error correction,
and hence, improved assembly quality. Additionally, we show that our
attention-based models have significant runtime improvement for the entire
pipeline -- 18X faster than previous works, due to parallelizing the attention
mechanism and the use of JIT compilation for GPU inferencing.
Related papers
- LM-Combiner: A Contextual Rewriting Model for Chinese Grammatical Error Correction [49.0746090186582]
Over-correction is a critical problem in Chinese grammatical error correction (CGEC) task.
Recent work using model ensemble methods can effectively mitigate over-correction and improve the precision of the GEC system.
We propose the LM-Combiner, a rewriting model that can directly modify the over-correction of GEC system outputs without a model ensemble.
arXiv Detail & Related papers (2024-03-26T06:12:21Z) - Transformers as Support Vector Machines [54.642793677472724]
We establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem.
We characterize the implicit bias of 1-layer transformers optimized with gradient descent.
We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.
arXiv Detail & Related papers (2023-08-31T17:57:50Z) - Accelerating Attention through Gradient-Based Learned Runtime Pruning [9.109136535767478]
Self-attention is a key enabler of state-of-art accuracy for transformer-based Natural Language Processing models.
This paper formulates its search through a soft differentiable regularizer integrated into the loss function of the training.
We devise a bit-serial architecture, dubbed LeOPArd, for transformer language models with bit-level early termination microarchitectural mechanism.
arXiv Detail & Related papers (2022-04-07T05:31:13Z) - Correct-N-Contrast: A Contrastive Approach for Improving Robustness to
Spurious Correlations [59.24031936150582]
Spurious correlations pose a major challenge for robust machine learning.
Models trained with empirical risk minimization (ERM) may learn to rely on correlations between class labels and spurious attributes.
We propose Correct-N-Contrast (CNC), a contrastive approach to directly learn representations robust to spurious correlations.
arXiv Detail & Related papers (2022-03-03T05:03:28Z) - MBCT: Tree-Based Feature-Aware Binning for Individual Uncertainty
Calibration [29.780204566046503]
We propose a feature-aware binning framework, called Multiple Boosting Trees (MBCT)
Our MBCT is non-monotonic, and has the potential to improve order accuracy, due to its learnable binning scheme and the individual calibration.
Results show that our method outperforms all competing models in terms of both calibration error and order accuracy.
arXiv Detail & Related papers (2022-02-09T08:59:16Z) - Newer is not always better: Rethinking transferability metrics, their
peculiarities, stability and performance [5.650647159993238]
Fine-tuning of large pre-trained image and language models on small customized datasets has become increasingly popular.
We show that the statistical problems with covariance estimation drive the poor performance of H-score.
We propose a correction and recommend measuring correlation performance against relative accuracy in such settings.
arXiv Detail & Related papers (2021-10-13T17:24:12Z) - Tail-to-Tail Non-Autoregressive Sequence Prediction for Chinese
Grammatical Error Correction [49.25830718574892]
We present a new framework named Tail-to-Tail (textbfTtT) non-autoregressive sequence prediction.
Considering that most tokens are correct and can be conveyed directly from source to target, and the error positions can be estimated and corrected.
Experimental results on standard datasets, especially on the variable-length datasets, demonstrate the effectiveness of TtT in terms of sentence-level Accuracy, Precision, Recall, and F1-Measure.
arXiv Detail & Related papers (2021-06-03T05:56:57Z) - Localized Calibration: Metrics and Recalibration [133.07044916594361]
We propose a fine-grained calibration metric that spans the gap between fully global and fully individualized calibration.
We then introduce a localized recalibration method, LoRe, that improves the LCE better than existing recalibration methods.
arXiv Detail & Related papers (2021-02-22T07:22:12Z) - Evaluating Prediction-Time Batch Normalization for Robustness under
Covariate Shift [81.74795324629712]
We call prediction-time batch normalization, which significantly improves model accuracy and calibration under covariate shift.
We show that prediction-time batch normalization provides complementary benefits to existing state-of-the-art approaches for improving robustness.
The method has mixed results when used alongside pre-training, and does not seem to perform as well under more natural types of dataset shift.
arXiv Detail & Related papers (2020-06-19T05:08:43Z) - Mix-n-Match: Ensemble and Compositional Methods for Uncertainty
Calibration in Deep Learning [21.08664370117846]
We show how Mix-n-Match calibration strategies can help achieve remarkably better data-efficiency and expressive power.
We also reveal potential issues in standard evaluation practices.
Our approaches outperform state-of-the-art solutions on both the calibration as well as the evaluation tasks.
arXiv Detail & Related papers (2020-03-16T17:00:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.