Can linguists better understand DNA?
- URL: http://arxiv.org/abs/2412.07678v3
- Date: Fri, 17 Jan 2025 08:54:50 GMT
- Title: Can linguists better understand DNA?
- Authors: Wang Liang,
- Abstract summary: This study addresses the existence of capability transfer between natural language and gene sequences/languages.
We constructed two analogous tasks: DNA-pair classification(DNA sequence similarity) and DNA-protein-pair classification(gene coding determination)
These tasks were designed to validate the transferability of capabilities from natural language to gene sequences.
- Score: 0.0
- License:
- Abstract: Multilingual transfer ability, which reflects how well models fine-tuned on one source language can be applied to other languages, has been well studied in multilingual pre-trained models. However, the existence of such capability transfer between natural language and gene sequences/languages remains under explored.This study addresses this gap by drawing inspiration from the sentence-pair classification task used for evaluating sentence similarity in natural language. We constructed two analogous tasks: DNA-pair classification(DNA sequence similarity) and DNA-protein-pair classification(gene coding determination). These tasks were designed to validate the transferability of capabilities from natural language to gene sequences. Even a small-scale pre-trained model like GPT-2-small, which was pre-trained on English, achieved an accuracy of 78% on the DNA-pair classification task after being fine-tuned on English sentence-pair classification data(XTREME PAWS-X). While training a BERT model on multilingual text, the precision reached 89%. On the more complex DNA-protein-pair classification task, however, the model's output was barely distinguishable from random output.Experimental validation has confirmed that the transfer of capabilities from natural language to biological language is unequivocally present. Building on this foundation, we have also investigated the impact of model parameter scale and pre-training on this capability transfer. We provide recommendations for facilitating the transfer of capabilities from natural language to genetic language,as well as new approaches for conducting biological research based on this capability.This study offers an intriguing new perspective on exploring the relationship between natural language and genetic language.
Related papers
- VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - Measuring Cross-lingual Transfer in Bytes [9.011910726620538]
We show that models from diverse languages perform similarly to a target language in a cross-lingual setting.
We also found evidence that this transfer is not related to language contamination or language proximity.
Our experiments have opened up new possibilities for measuring how much data represents the language-agnostic representations learned during pretraining.
arXiv Detail & Related papers (2024-04-12T01:44:46Z) - Efficient and Scalable Fine-Tune of Language Models for Genome
Understanding [49.606093223945734]
We present textscLingo: textscLanguage prefix ftextscIne-tuning for textscGentextscOmes.
Unlike DNA foundation models, textscLingo strategically leverages natural language foundation models' contextual cues.
textscLingo further accommodates numerous downstream fine-tune tasks by an adaptive rank sampling method.
arXiv Detail & Related papers (2024-02-12T21:40:45Z) - Generative Language Models on Nucleotide Sequences of Human Genes [0.0]
This study focuses on developing an autoregressive generative language model like GPT-3 for DNA sequences.
Because working with whole DNA sequences is challenging without substantial computational resources, we decided to carry out our study on a smaller scale.
First of all, we systematically examined an almost entirely unexplored problem and observed that RNNs performed the best.
How essential using real-life tasks beyond the classical metrics such as perplexity is observed.
arXiv Detail & Related papers (2023-07-20T06:59:02Z) - Languages You Know Influence Those You Learn: Impact of Language
Characteristics on Multi-Lingual Text-to-Text Transfer [4.554080966463776]
Multi-lingual language models (LM) have been remarkably successful in enabling natural language tasks in low-resource languages.
We try to better understand how such models, specifically mT5, transfer *any* linguistic and semantic knowledge across languages.
A key finding of this work is that similarity of syntax, morphology and phonology are good predictors of cross-lingual transfer.
arXiv Detail & Related papers (2022-12-04T07:22:21Z) - Geographical Distance Is The New Hyperparameter: A Case Study Of Finding
The Optimal Pre-trained Language For English-isiZulu Machine Translation [0.0]
This study explores the potential benefits of transfer learning in an English-isiZulu translation framework.
We gathered results from 8 different language corpora, including one multi-lingual corpus, and saw that isiXa-isiZulu outperformed all languages.
We also derived a new coefficient, Nasir's Geographical Distance Coefficient (NGDC) which provides an easy selection of languages for the pre-trained models.
arXiv Detail & Related papers (2022-05-17T20:41:25Z) - Linking Emergent and Natural Languages via Corpus Transfer [98.98724497178247]
We propose a novel way to establish a link by corpus transfer between emergent languages and natural languages.
Our approach showcases non-trivial transfer benefits for two different tasks -- language modeling and image captioning.
We also introduce a novel metric to predict the transferability of an emergent language by translating emergent messages to natural language captions grounded on the same images.
arXiv Detail & Related papers (2022-03-24T21:24:54Z) - Cross-Lingual Ability of Multilingual Masked Language Models: A Study of
Language Structure [54.01613740115601]
We study three language properties: constituent order, composition and word co-occurrence.
Our main conclusion is that the contribution of constituent order and word co-occurrence is limited, while the composition is more crucial to the success of cross-linguistic transfer.
arXiv Detail & Related papers (2022-03-16T07:09:35Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z) - Cross-lingual, Character-Level Neural Morphological Tagging [57.0020906265213]
We train character-level recurrent neural taggers to predict morphological taggings for high-resource languages and low-resource languages together.
Learning joint character representations among multiple related languages successfully enables knowledge transfer from the high-resource languages to the low-resource ones, improving accuracy by up to 30% over a monolingual model.
arXiv Detail & Related papers (2017-08-30T08:14:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.