Fugu-MT 論文翻訳(概要): Same model, better performance: the impact of shuffling on DNA Language Models benchmarking

論文の概要: Same model, better performance: the impact of shuffling on DNA Language Models benchmarking

arxiv url: http://arxiv.org/abs/2510.12617v1
Date: Tue, 14 Oct 2025 15:16:56 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-15 19:02:32.365292
Title: Same model, better performance: the impact of shuffling on DNA Language Models benchmarking
Title（参考訳）: 同じモデル、より良いパフォーマンス:シャッフルがDNA言語モデルベンチマークに及ぼす影響
Authors: Davide Greco, Konrad Rawlik,
Abstract要約: 大規模言語モデルは、複雑な生物学的配列をデコードする可能性から、ゲノム学でますます人気がある。 DNA LMの評価は、ゲノムのドメイン固有の課題と機械学習の方法論を交差させる複雑なタスクであることを示す。ストレージの前にデータを事前シャッフルすることで、効率を保ちながらハードウェア依存をなくすという簡単な解決策を提案する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models are increasingly popular in genomics due to their potential to decode complex biological sequences. Hence, researchers require a standardized benchmark to evaluate DNA Language Models (DNA LMs) capabilities. However, evaluating DNA LMs is a complex task that intersects genomic's domain-specific challenges and machine learning methodologies, where seemingly minor implementation details can significantly compromise benchmark validity. We demonstrate this through BEND (Benchmarking DNA Language Models), where hardware-dependent hyperparameters -- number of data loading workers and buffer sizes -- create spurious performance variations of up to 4% for identical models. The problem stems from inadequate data shuffling interacting with domain specific data characteristics. Experiments with three DNA language models (HyenaDNA, DNABERT-2, ResNet-LM) show these artifacts affect both absolute performance and relative model rankings. We propose a simple solution: pre-shuffling data before storage eliminates hardware dependencies while maintaining efficiency. This work highlights how standard ML practices can interact unexpectedly with domain-specific data characteristics, with broader implications for benchmark design in specialized domains.
Abstract（参考訳）: 大規模言語モデルは、複雑な生物学的配列をデコードする可能性から、ゲノム学でますます人気がある。したがって、研究者はDNA言語モデル(DNA LM)の機能を評価するために標準化されたベンチマークを必要とする。しかし、DNA LMの評価はゲノムのドメイン固有の課題と機械学習の方法論とを交わす複雑なタスクであり、一見小さな実装の詳細がベンチマークの有効性を著しく損なう可能性がある。 BEND(Benchmarking DNA Language Models)では、ハードウェア依存のハイパーパラメータ -- データロードワーカーの数とバッファサイズ -- が、同じモデルに対して最大4%のパフォーマンスの急激なバリエーションを生み出します。この問題は、ドメイン固有のデータ特性と相互作用するデータのシャッフルが不十分であることに起因している。 3つのDNA言語モデル(HyenaDNA、DNABERT-2、ResNet-LM)を用いた実験は、これらのアーティファクトが絶対性能と相対モデルランキングの両方に影響を与えることを示している。ストレージの前にデータを事前シャッフルすることで、効率を保ちながらハードウェア依存をなくすという簡単な解決策を提案する。この研究は、標準のMLプラクティスがドメイン固有のデータ特性と予期せず相互作用する方法を強調し、特定のドメインにおけるベンチマーク設計により大きな意味を持つ。

論文の概要: Same model, better performance: the impact of shuffling on DNA Language Models benchmarking

関連論文リスト