Fugu-MT 論文翻訳(概要): DPF-CM: A Data Processing Framework with Privacy-Preserving Vector Databases for Chinese Medical LLMs Training and Deployment

論文の概要: DPF-CM: A Data Processing Framework with Privacy-Preserving Vector Databases for Chinese Medical LLMs Training and Deployment

arxiv url: http://arxiv.org/abs/2509.01354v1
Date: Mon, 01 Sep 2025 10:49:32 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-04 15:17:03.652168
Title: DPF-CM: A Data Processing Framework with Privacy-Preserving Vector Databases for Chinese Medical LLMs Training and Deployment
Title（参考訳）: DPF-CM:中国医学LLMのトレーニングと展開のためのプライバシ保護ベクタデータベースを用いたデータ処理フレームワーク
Authors: Wei Huang, Anda Cheng, Zhao Zhang, Yinggui Wang,
Abstract要約: DPF-CMは中国医学モデルのデータ処理フレームワークである。 First Moduleは、モデルトレーニング用に調整されたデータ処理パイプラインである。第2のモジュールは、モデルデプロイメント時のプライバシ保護に焦点を当てている。
参考スコア（独自算出の注目度）: 13.757046926346936
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current open-source training pipelines for Chinese medical language models predominantly emphasize optimizing training methodologies to enhance the performance of large language models (LLMs), yet lack comprehensive exploration into training data processing. To address this gap, we propose DPF-CM, a holistic Data Processing Framework for Chinese Medical LLMs training and deployment. DPF-CM comprises two core modules. The first module is a data processing pipeline tailored for model training. Beyond standard data processing operations, we (1) introduce a chained examples context-learning strategy to generate question-oriented instructions to mitigate the lack of instruction content, and (2) implement an ensemble-based filtering mechanism for preference data curation that averages multiple reward models to suppress noisy samples. The second module focuses on privacy preservation during model deployment. To prevent privacy risks from the inadvertent exposure of training data, we propose a Privacy Preserving Vector Database (PPVD) approach, which involves model memory search, high-risk database construction, secure database construction, and match-and-replace, four key stages to minimize privacy leakage during inference collectively. Experimental results show that DPF-CM significantly improves model accuracy, enabling our trained Chinese medical LLM to achieve state-of-the-art performance among open-source counterparts. Moreover, the framework reduces training data privacy leakage by 27%.
Abstract（参考訳）: 現在の中国医学モデルのオープンソーストレーニングパイプラインは、大規模言語モデル(LLM)の性能向上のためにトレーニング方法論の最適化を主に重視しているが、トレーニングデータ処理に関する包括的な調査は欠如している。このギャップに対処するため,中国医学LLMのトレーニングと展開のための総合データ処理フレームワークDPF-CMを提案する。 DPF-CMは2つのコアモジュールからなる。最初のモジュールは、モデルトレーニング用に調整されたデータ処理パイプラインである。標準的なデータ処理操作以外にも,(1)指示内容の欠如を軽減するための質問指向の指示を生成する連鎖した事例学習戦略を導入し,(2)複数の報奨モデルを平均化しノイズのあるサンプルを抑える選好データキュレーションのためのアンサンブルベースのフィルタリング機構を実装した。第2のモジュールは、モデルデプロイメント時のプライバシ保護に重点を置いている。トレーニングデータの意図しない露出によるプライバシーリスクを防止するために,モデルメモリ検索,ハイリスクデータベース構築,セキュアデータベース構築,マッチ・アンド・リプレースを含むプライバシ保護ベクトルデータベース(PPVD)アプローチを提案する。実験の結果,DPF-CMはモデルの精度を著しく向上し,訓練された中国の医療用LLMがオープンソース企業間で最先端の性能を達成できることが示唆された。さらに、このフレームワークはデータのプライバシリークのトレーニングを27%削減する。

論文の概要: DPF-CM: A Data Processing Framework with Privacy-Preserving Vector Databases for Chinese Medical LLMs Training and Deployment

関連論文リスト