Related papers: HAD: Hybrid Architecture Distillation Outperforms Teacher in Genomic Sequence Modeling

HAD: Hybrid Architecture Distillation Outperforms Teacher in Genomic Sequence Modeling

URL: http://arxiv.org/abs/2505.20836v1
Date: Tue, 27 May 2025 07:57:35 GMT
Title: HAD: Hybrid Architecture Distillation Outperforms Teacher in Genomic Sequence Modeling
Authors: Hexiong Yang, Mingrui Chen, Huaibo Huang, Junxian Duan, Jie Cao, Zhen Zhou, Ran He,
Abstract summary: We propose a Hybrid Architecture Distillation (HAD) approach for DNA sequence modeling.<n>We employ the NTv2-500M as the teacher model and devise a grouping masking strategy.<n>Compared to models with similar parameters, our model achieved excellent performance.
Score: 52.58723853697152
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Inspired by the great success of Masked Language Modeling (MLM) in the natural language domain, the paradigm of self-supervised pre-training and fine-tuning has also achieved remarkable progress in the field of DNA sequence modeling. However, previous methods often relied on massive pre-training data or large-scale base models with huge parameters, imposing a significant computational burden. To address this, many works attempted to use more compact models to achieve similar outcomes but still fell short by a considerable margin. In this work, we propose a Hybrid Architecture Distillation (HAD) approach, leveraging both distillation and reconstruction tasks for more efficient and effective pre-training. Specifically, we employ the NTv2-500M as the teacher model and devise a grouping masking strategy to align the feature embeddings of visible tokens while concurrently reconstructing the invisible tokens during MLM pre-training. To validate the effectiveness of our proposed method, we conducted comprehensive experiments on the Nucleotide Transformer Benchmark and Genomic Benchmark. Compared to models with similar parameters, our model achieved excellent performance. More surprisingly, it even surpassed the distillation ceiling-teacher model on some sub-tasks, which is more than 500 $\times$ larger. Lastly, we utilize t-SNE for more intuitive visualization, which shows that our model can gain a sophisticated understanding of the intrinsic representation pattern in genomic sequences.

Related papers

Nanbeige4-3B Technical Report: Exploring the Frontier of Small Language Models [23.832817775138675]
Nanbeige4-3B is a family of small-scale but high-performing language models.<n>Pretrained on 23T high-quality tokens and finetuned on over 30 million diverse instructions, we extend the boundary of the scaling law for small language models.
arXiv Detail & Related papers (2025-12-06T03:36:27Z)
Round-trip Reinforcement Learning: Self-Consistent Training for Better Chemical LLMs [51.29260537017623]
Large Language Models (LLMs) are emerging as versatile foundation models for computational chemistry.<n>These models often lack round-trip consistency.<n>We introduce Round-Trip Reinforcement Learning (RTRL), a novel framework that trains a model to improve its consistency.
arXiv Detail & Related papers (2025-10-01T23:58:58Z)
GRAM: A Generative Foundation Reward Model for Reward Generalization [48.63394690265176]
We develop a generative reward model that is first trained via large-scale unsupervised learning and then fine-tuned via supervised learning.<n>This model generalizes well across several tasks, including response ranking, reinforcement learning from human feedback, and task adaptation with fine-tuning.
arXiv Detail & Related papers (2025-06-17T04:34:27Z)
Platonic Grounding for Efficient Multimodal Language Models [22.715168904364756]
We motivate and propose a simple modification to existing multimodal frameworks that rely on aligning pretrained models.<n>Our work also has implications for combining pretrained models into larger systems efficiently.
arXiv Detail & Related papers (2025-04-27T18:56:26Z)
A Collaborative Ensemble Framework for CTR Prediction [73.59868761656317]
We propose a novel framework, Collaborative Ensemble Training Network (CETNet), to leverage multiple distinct models. Unlike naive model scaling, our approach emphasizes diversity and collaboration through collaborative learning. We validate our framework on three public datasets and a large-scale industrial dataset from Meta.
arXiv Detail & Related papers (2024-11-20T20:38:56Z)
Masked Generative Priors Improve World Models Sequence Modelling Capabilities [19.700020499490137]
Masked Generative Modelling has emerged as a more efficient and superior inductive bias for modelling.<n>GIT-STORM demonstrates substantial performance gains in RL tasks on the Atari 100k benchmark.<n>We apply Transformer-based World Models to continuous action environments for the first time, addressing a significant gap in prior research.
arXiv Detail & Related papers (2024-10-10T11:52:07Z)
Is Tokenization Needed for Masked Particle Modelling? [8.79008927474707]
Masked particle modeling (MPM) is a self-supervised learning scheme for constructing expressive representations of unordered sets. We improve MPM by addressing inefficiencies in the implementation and incorporating a more powerful decoder. We show that these new methods outperform the tokenized learning objective from the original MPM on a new test bed for foundation models for jets.
arXiv Detail & Related papers (2024-09-19T09:12:29Z)
Majority Kernels: An Approach to Leverage Big Model Dynamics for Efficient Small Model Training [32.154166415680066]
Methods like distillation, compression or quantization help leverage the highly performant large models to induce smaller performant ones. This paper explores the hypothesis that a single training run can simultaneously train a larger model for performance and derive a smaller model for deployment.
arXiv Detail & Related papers (2024-02-07T17:07:41Z)
PanGu-$\pi$: Enhancing Language Model Architectures via Nonlinearity Compensation [97.78045712375047]
We present a new efficient model architecture for large language models (LLMs) We show that PanGu-$pi$-7B can achieve a comparable performance to that of benchmarks with about 10% inference speed-up. In addition, we have deployed PanGu-$pi$-7B in the high-value domains of finance and law, developing an LLM named YunShan for practical application.
arXiv Detail & Related papers (2023-12-27T11:49:24Z)
METRO: Efficient Denoising Pretraining of Large Scale Autoencoding Language Models with Model Generated Signals [151.3601429216877]
We present an efficient method of pretraining large-scale autoencoding language models using training signals generated by an auxiliary model. We propose a recipe, namely "Model generated dEnoising TRaining Objective" (METRO) The resultant models, METRO-LM, consisting of up to 5.4 billion parameters, achieve new state-of-the-art on the GLUE, SuperGLUE, and SQuAD benchmarks.
arXiv Detail & Related papers (2022-04-13T21:39:15Z)
Model-Agnostic Multitask Fine-tuning for Few-shot Vision-Language Transfer Learning [59.38343286807997]
We propose Model-Agnostic Multitask Fine-tuning (MAMF) for vision-language models on unseen tasks. Compared with model-agnostic meta-learning (MAML), MAMF discards the bi-level optimization and uses only first-order gradients. We show that MAMF consistently outperforms the classical fine-tuning method for few-shot transfer learning on five benchmark datasets.
arXiv Detail & Related papers (2022-03-09T17:26:53Z)
Improving Non-autoregressive Generation with Mixup Training [51.61038444990301]
We present a non-autoregressive generation model based on pre-trained transformer models. We propose a simple and effective iterative training method called MIx Source and pseudo Target. Our experiments on three generation benchmarks including question generation, summarization and paraphrase generation, show that the proposed framework achieves the new state-of-the-art results.
arXiv Detail & Related papers (2021-10-21T13:04:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.