Related papers: Same model, better performance: the impact of shuffling on DNA Language Models benchmarking

Same model, better performance: the impact of shuffling on DNA Language Models benchmarking

URL: http://arxiv.org/abs/2510.12617v1
Date: Tue, 14 Oct 2025 15:16:56 GMT
Title: Same model, better performance: the impact of shuffling on DNA Language Models benchmarking
Authors: Davide Greco, Konrad Rawlik,
Abstract summary: Large Language Models are increasingly popular in genomics due to their potential to decode complex biological sequences.<n>We show that evaluating DNA LMs is a complex task that intersects genomic's domain-specific challenges and machine learning methodologies.<n>We propose a simple solution: pre-shuffling data before storage eliminates hardware dependencies while maintaining efficiency.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models are increasingly popular in genomics due to their potential to decode complex biological sequences. Hence, researchers require a standardized benchmark to evaluate DNA Language Models (DNA LMs) capabilities. However, evaluating DNA LMs is a complex task that intersects genomic's domain-specific challenges and machine learning methodologies, where seemingly minor implementation details can significantly compromise benchmark validity. We demonstrate this through BEND (Benchmarking DNA Language Models), where hardware-dependent hyperparameters -- number of data loading workers and buffer sizes -- create spurious performance variations of up to 4% for identical models. The problem stems from inadequate data shuffling interacting with domain specific data characteristics. Experiments with three DNA language models (HyenaDNA, DNABERT-2, ResNet-LM) show these artifacts affect both absolute performance and relative model rankings. We propose a simple solution: pre-shuffling data before storage eliminates hardware dependencies while maintaining efficiency. This work highlights how standard ML practices can interact unexpectedly with domain-specific data characteristics, with broader implications for benchmark design in specialized domains.

Related papers

JanusDNA: A Powerful Bi-directional Hybrid DNA Foundation Model [7.8918969994977575]
Large language models (LLMs) have revolutionized natural language processing and are increasingly applied to other sequential data types.<n>We introduce JanusDNA, the first bidirectional DNA foundation model built upon a novel pretraining paradigm.<n>JanusDNA processes up to 1 million base pairs at single nucleotide resolution on a single 80GB GPU.
arXiv Detail & Related papers (2025-05-22T20:10:55Z)
ReGUIDE: Data Efficient GUI Grounding via Spatial Reasoning and Search [53.40810298627443]
ReGUIDE is a framework for web grounding that enables MLLMs to learn data efficiently through self-generated reasoning and spatial-aware criticism.<n>Our experiments demonstrate that ReGUIDE significantly advances web grounding performance across multiple benchmarks.
arXiv Detail & Related papers (2025-05-21T08:36:18Z)
HybriDNA: A Hybrid Transformer-Mamba2 Long-Range DNA Language Model [70.69095062674944]
We propose HybriDNA, a decoder-only DNA language model that incorporates a hybrid Transformer-Mamba2 architecture.<n>This hybrid design enables HybriDNA to efficiently process DNA sequences up to 131kb in length with single-nucleotide resolution.<n>HybriDNA achieves state-of-the-art performance across 33 DNA understanding datasets curated from the BEND, GUE, and LRB benchmarks.
arXiv Detail & Related papers (2025-02-15T14:23:43Z)
DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA [2.543784712990392]
Large genomic DNA language models (DNALMs) aim to learn generalizable representations of diverse DNA elements.<n>Our benchmarks target biologically meaningful downstream tasks such as functional sequence feature discovery, predicting cell-type specific regulatory activity, and counterfactual prediction of the impacts of genetic variants.
arXiv Detail & Related papers (2024-12-06T21:23:35Z)
DNAHLM -- DNA sequence and Human Language mixed large language Model [0.0]
This paper introduces a pre-trained model trained on the GPT-2 network, combining DNA sequences and English text. We then convert classification and other downstream tasks into Alpaca format instruction data, and perform instruction fine-tuning. The model has demonstrated its effectiveness in DNA related zero-shot prediction and multitask application.
arXiv Detail & Related papers (2024-10-22T11:51:09Z)
A Benchmark Dataset for Multimodal Prediction of Enzymatic Function Coupling DNA Sequences and Natural Language [3.384797724820242]
Predicting gene function from its DNA sequence is a fundamental challenge in biology. Deep learning models have been proposed to embed DNA sequences and predict their enzymatic function. Much of the scientific community's knowledge of biological function is not represented in categorical labels.
arXiv Detail & Related papers (2024-07-21T19:27:43Z)
Efficient and Scalable Fine-Tune of Language Models for Genome Understanding [49.606093223945734]
We present textscLingo: textscLanguage prefix ftextscIne-tuning for textscGentextscOmes. Unlike DNA foundation models, textscLingo strategically leverages natural language foundation models' contextual cues. textscLingo further accommodates numerous downstream fine-tune tasks by an adaptive rank sampling method.
arXiv Detail & Related papers (2024-02-12T21:40:45Z)
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models [115.501751261878]
Fine-tuning language models(LMs) on human-generated data remains a prevalent practice. We investigate whether we can go beyond human data on tasks where we have access to scalar feedback. We find that ReST$EM$ scales favorably with model size and significantly surpasses fine-tuning only on human data.
arXiv Detail & Related papers (2023-12-11T18:17:43Z)
BEND: Benchmarking DNA Language Models on biologically meaningful tasks [7.005668635562045]
We introduce BEND, a Benchmark for DNA language models, featuring a collection of realistic and biologically meaningful downstream tasks. We find that embeddings from current DNA LMs can approach performance of expert methods on some tasks, but only capture limited information about long-range features.
arXiv Detail & Related papers (2023-11-21T12:34:00Z)
The effect of data augmentation and 3D-CNN depth on Alzheimer's Disease detection [51.697248252191265]
This work summarizes and strictly observes best practices regarding data handling, experimental design, and model evaluation. We focus on Alzheimer's Disease (AD) detection, which serves as a paradigmatic example of challenging problem in healthcare. Within this framework, we train predictive 15 models, considering three different data augmentation strategies and five distinct 3D CNN architectures.
arXiv Detail & Related papers (2023-09-13T10:40:41Z)
Improving Classifier Training Efficiency for Automatic Cyberbullying Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods. We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments. The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.