Fugu-MT 論文翻訳(概要): On the synchronization between Hugging Face pre-trained language models and their upstream GitHub repository

論文の概要: On the synchronization between Hugging Face pre-trained language models and their upstream GitHub repository

arxiv url: http://arxiv.org/abs/2508.10157v1
Date: Wed, 13 Aug 2025 19:45:09 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-15 22:24:48.101946
Title: On the synchronization between Hugging Face pre-trained language models and their upstream GitHub repository
Title（参考訳）: Hugging Face事前訓練言語モデルと上流GitHubリポジトリの同期について
Authors: Ajibode Adekunle, Abdul Ali Bangash, Bram Adams, Ahmed E. Hassan,
Abstract要約: 事前訓練された言語モデル(PTLM)は、高度な自然言語処理(NLP)を持つ。 PTLMはアップストリームリポジトリ(GitHub、GHなど)のコードと環境スクリプトを使用してトレーニングされ、Hugging Face (HF)のような下流プラットフォーム経由で変種として配布される。 GHとHFのコーディネート開発は、不整合リリーススケジュール、一貫性のないバージョニング、PTLMバリアントの限定的な再利用といった課題を生んでいる。
参考スコア（独自算出の注目度）: 11.828311976126303
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Pretrained language models (PTLMs) have advanced natural language processing (NLP), enabling progress in tasks like text generation and translation. Like software package management, PTLMs are trained using code and environment scripts in upstream repositories (e.g., GitHub, GH) and distributed as variants via downstream platforms like Hugging Face (HF). Coordinating development between GH and HF poses challenges such as misaligned release timelines, inconsistent versioning, and limited reuse of PTLM variants. We conducted a mixed-method study of 325 PTLM families (904 HF variants) to examine how commit activities are coordinated. Our analysis reveals that GH contributors typically make changes related to specifying the version of the model, improving code quality, performance optimization, and dependency management within the training scripts, while HF contributors make changes related to improving model descriptions, data set handling, and setup required for model inference. Furthermore, to understand the synchronization aspects of commit activities between GH and HF, we examined three dimensions of these activities -- lag (delay), type of synchronization, and intensity -- which together yielded eight distinct synchronization patterns. The prevalence of partially synchronized patterns, such as Disperse synchronization and Sparse synchronization, reveals structural disconnects in current cross-platform release practices. These patterns often result in isolated changes -- where improvements or fixes made on one platform are never replicated on the other -- and in some cases, indicate an abandonment of one repository in favor of the other. Such fragmentation risks exposing end users to incomplete, outdated, or behaviorally inconsistent models. Hence, recognizing these synchronization patterns is critical for improving oversight and traceability in PTLM release workflows.
Abstract（参考訳）: 事前訓練された言語モデル(PTLM)は高度な自然言語処理(NLP)を持ち、テキスト生成や翻訳といったタスクの進歩を可能にする。ソフトウェアパッケージ管理と同様に、PTLMは上流リポジトリ(GitHub、GHなど)のコードと環境スクリプトを使用してトレーニングされ、Hugging Face (HF)のような下流プラットフォームを介して変種として配布される。 GHとHFのコーディネート開発は、不整合リリーススケジュール、一貫性のないバージョニング、PTLMバリアントの限定的な再利用といった課題を生んでいる。 325のPTLMファミリー(904のHF変異体)を混合分析し,コミット活動のコーディネートについて検討した。我々の分析によると、GHコントリビュータは一般的に、モデルのバージョン指定、コード品質の改善、パフォーマンス最適化、およびトレーニングスクリプト内の依存性管理に関する変更を行う一方、HFコントリビュータはモデル記述の改善、データセットハンドリング、モデル推論に必要なセットアップに関する変更を行う。さらに,GHとHFのコミットアクティビティの同期的側面を理解するために,これらのアクティビティの3次元 – 遅延(遅延),同期の種類,強度 – を検討した。 Disperse同期やスパース同期といった部分同期パターンの出現は、現在のクロスプラットフォームリリースプラクティスにおける構造的切断を明らかにしている。これらのパターンは、あるプラットフォームで行われた改善や修正が他のプラットフォームに複製されないような、分離された変更をもたらすことが多い。このような断片化は、エンドユーザーを不完全、時代遅れ、あるいは行動に一貫性のないモデルに晒すリスクがある。したがって、これらの同期パターンを認識することは、PTLMリリースワークフローの監視とトレーサビリティを向上させるために重要である。

論文の概要: On the synchronization between Hugging Face pre-trained language models and their upstream GitHub repository

関連論文リスト