Fugu-MT 論文翻訳(概要): SCOPE: Siamese Contrastive Operon Pair Embeddings for Functional Sequence Representation and Classification

論文の概要: SCOPE: Siamese Contrastive Operon Pair Embeddings for Functional Sequence Representation and Classification

arxiv url: http://arxiv.org/abs/2605.11022v1
Date: Sun, 10 May 2026 16:52:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 21:48:56.309984
Title: SCOPE: Siamese Contrastive Operon Pair Embeddings for Functional Sequence Representation and Classification
Title（参考訳）: SCOPE:機能的シーケンス表現と分類のためのシームコントラストオペロンペア埋め込み
Authors: Akarsh Gupta, Kenneth Rodrigues, Sagnik Chatterjee,
Abstract要約: オペオンの同定は原核生物の遺伝子制御を理解するための基本的なステップです DGEBベンチマークは、各配列を事前訓練されたタンパク質言語モデルに独立して埋め込み、オペニックペア分類を評価する。タンパク質言語モデルがROC-AUCの物理化学的特徴を著しく上回るが、学習されたシームズヘッドは平均的類似性よりも有意に改善しない。これらの結果から,タンパク質言語モデルの組込みは,オペニックペア分類のための,実用的でスケーラブルな基盤であることが示唆された。
参考スコア（独自算出の注目度）: 0.3823356975862005
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Identifying operons is a fundamental step in understanding prokaryotic gene regulation, as classifying genes into operons supports the reconstruction of regulatory networks, functional annotation of unannotated genes, and drug candidate development. Experimental approaches such as RT-PCR and RNA-seq provide precise evidence of operon structure, but are laborious and largely limited to well-studied model organisms, making scalable computational methods essential for genome-wide operon identification. Prior computational approaches have employed traditional classifiers such as logistic regression and decision trees, motivating our use of these as physicochemical baselines. The DGEB benchmark evaluates operonic pair classification by embedding each sequence independently with a pre-trained protein language model and computing pairwise cosine similarity. In contrast, our Siamese MLP learns a classifier over the fused embedding space, which is theoretically better motivated for binary classification, as cosine similarity can yield meaningless scores depending on the regularization of the embedding model. While protein language model embeddings substantially outperform physicochemical features in ROC-AUC, a learned Siamese MLP head does not significantly improve over unsupervised cosine similarity in Average Precision, suggesting that the geometry of the embedding space already captures the functional relationships needed for this task. Nonetheless, our Siamese MLP achieves a ROC-AUC of 0.71, competitive with state-of-the-art models on the DGEB leaderboard. These findings indicate that protein language model embeddings are a viable, scalable foundation for operonic pair classification across diverse microbial genomes, with implications for automated genome annotation, regulatory network reconstruction, and characterization of organisms lacking experimental operon annotations.
Abstract（参考訳）: オペロンの同定は、遺伝子をオペロンに分類する際の基本的なステップであり、オペロンは、制御ネットワークの再構築、注釈のない遺伝子の機能的アノテーション、薬物候補の発達をサポートする。 RT-PCRやRNA-seqのような実験的なアプローチはオペロンの構造の正確な証拠を提供するが、十分に研究されたモデル生物に限られており、ゲノム全体のオペロン同定に不可欠なスケーラブルな計算方法となっている。従来の計算手法では、ロジスティック回帰や決定木といった従来の分類法を採用しており、これらを物理化学的ベースラインとして活用する動機となっている。 DGEBベンチマークは、各配列を事前訓練されたタンパク質言語モデルと独立に埋め込み、ペアワイズコサイン類似性を計算することで、オペニックペア分類を評価する。対照的に、我々のシームズ MLP は融合埋め込み空間上の分類器を学習し、これは理論的には二項分類の動機付けがより良く、コサイン類似性は埋め込みモデルの正規化によって無意味なスコアを得ることができる。タンパク質言語モデルの埋め込みはROC-AUCの物理化学的特性を大幅に上回っているが、学習されたシームズMLPヘッドは平均精度において教師なしコサイン類似性よりも著しく改善していないため、埋め込み空間の幾何学が既にこの課題に必要な機能的関係を捉えていることが示唆されている。それでも、私たちのSamese MLPは、DGEBのリーダーボードの最先端モデルと競合するROC-AUC 0.71を達成しています。これらの結果から, タンパク質言語モデル埋め込みは, 多様な微生物ゲノムにまたがるオペロン対分類のための, 実用的でスケーラブルな基盤であり, 自動ゲノムアノテーション, 制御ネットワーク再構築, 実験的なオペロンアノテーションが欠如している生物のキャラクタリゼーションに寄与することが示唆された。

論文の概要: SCOPE: Siamese Contrastive Operon Pair Embeddings for Functional Sequence Representation and Classification

関連論文リスト