Fugu-MT 論文翻訳(概要): Towards Imputation of Pre-Trained Language Model Metadata using Semantic Fingerprinting

論文の概要: Towards Imputation of Pre-Trained Language Model Metadata using Semantic Fingerprinting

arxiv url: http://arxiv.org/abs/2606.21787v1
Date: Fri, 19 Jun 2026 22:36:21 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-26 03:17:29.664226
Title: Towards Imputation of Pre-Trained Language Model Metadata using Semantic Fingerprinting
Title（参考訳）: セマンティックフィンガープリントを用いた事前学習言語モデルメタデータのインプットに向けて
Authors: Adekunle Ajibode, Oussama Ben Sghaier, Keheliya Gallaba, Bram Adams, Ahmed E. Hassan,
Abstract要約: Hugging Faceのようなプラットフォーム上でホストされる事前訓練された言語モデル(PTLM)は、ソフトウェア依存グラフに似た複雑な系統構造を形成する。従来のソフトウェアエコシステムとは異なり、PTLMレポジトリはメタデータの欠如により信頼性に欠けることが多い。本稿では,Huging Face(HF)設定ファイルとモデルリポジトリタグを組み合わせた軽量なアプローチであるSemantic Fingerprinting(SemFin)を紹介する。
参考スコア（独自算出の注目度）: 9.039328994118895
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Pre-trained language models (PTLMs) hosted on platforms such as Hugging Face form complex lineage structures similar to software dependency graphs. However, unlike traditional software ecosystems, PTLM repositories often lack reliable provenance due to missing metadata, such as licenses, reuse methods, pipeline tags, model types, and training libraries. To address this gap, we introduce Semantic Fingerprinting (SemFin), a lightweight approach that combines Hugging Face (HF) configuration files with model repository tags to automatically impute missing model metadata fields and reconstruct model lineage chains. We evaluate SemFin on a large-scale dataset of 317,133 PTLMs. Our results show that configuration files typically encode the technical requirements necessary to instantiate and reuse models, enabling them to serve as a structural blueprint for model reuse, particularly for transformer-based architectures. By combining these configuration files with model repository tags, SemFin significantly outperforms the existing propagation-based imputation approaches, improving prediction accuracy by up to 31.4% and 26.6% compared to Graph Avg and Hub Avg baselines. Importantly, SemFin also imputes metadata for 16.6% of isolated models where propagation-based methods fail. Applying SemFin to impute missing reuse-method and license metadata for 167,089 unlabeled models reveals that traceable reuse method chains expand by 75.9% and license lineage chains by 53.6%, uncovering 86 previously invisible reuse method patterns, while the proportion of incompatible license patterns only increases from 34.8% to 36.8%. These findings demonstrate how automatically derived structural signals can support the automated construction of AI Bills of Materials (AIBOMs), helping transform metadata from an error-prone manual declaration into information inferred directly from model artifacts.
Abstract（参考訳）: Hugging Faceのようなプラットフォーム上でホストされる事前訓練された言語モデル(PTLM)は、ソフトウェア依存グラフに似た複雑な系統構造を形成する。しかし、従来のソフトウェアエコシステムとは異なり、PTLMリポジトリはライセンス、再利用メソッド、パイプラインタグ、モデルタイプ、トレーニングライブラリなどのメタデータが欠けているため、信頼性に欠けることが多い。このギャップに対処するために、Semantic Fingerprinting(SemFin)という、Hugging Face(HF)設定ファイルとモデルリポジトリタグを組み合わせた軽量なアプローチを紹介します。 317,133個のPTLMの大規模データセット上でSemFinを評価する。以上の結果から, モデルインスタンス化や再利用に必要な技術的要件をコンフィグレーションファイルにエンコードすることで, モデル再利用, 特にトランスフォーマーベースのアーキテクチャにおいて, モデル再利用のための構造的青写真として機能することを示す。これらの構成ファイルとモデルリポジトリタグを組み合わせることで、SemFinは既存の伝搬ベースの計算手法よりも大幅に優れ、Graph AvgやHub Avgに比べて予測精度が31.4%、26.6%向上した。重要なことに、SemFinは、伝搬ベースのメソッドが失敗する独立したモデルの16.6%のメタデータを暗示する。 SemFinを適用して167,089の未ラベルモデルの再利用メソッドとライセンスメタデータを注入すると、トレーサブルな再利用方法チェーンが75.9%拡大し、ライセンス系統チェーンが53.6%拡大し、以前は目に見えない86の再利用方法パターンが発見される一方、互換性のないライセンスパターンの割合は34.8%から36.8%にしか増加しない。これらの結果は,AI Bills of Materials (AIBOMs) の自動構築を支援する構造信号の自動生成が,エラーが発生しやすい手動宣言からモデルアーチファクトから直接推測される情報へのメタデータ変換を支援することを示す。

論文の概要: Towards Imputation of Pre-Trained Language Model Metadata using Semantic Fingerprinting

関連論文リスト