Fugu-MT 論文翻訳(概要): Benchmarking Local LLMs for Natural-Language-to-SQL Querying in Biopharmaceutical Manufacturing: An Empirical Benchmark on Consumer-Grade Hardware

論文の概要: Benchmarking Local LLMs for Natural-Language-to-SQL Querying in Biopharmaceutical Manufacturing: An Empirical Benchmark on Consumer-Grade Hardware

arxiv url: http://arxiv.org/abs/2606.01338v1
Date: Sun, 31 May 2026 16:41:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 21:34:29.59387
Title: Benchmarking Local LLMs for Natural-Language-to-SQL Querying in Biopharmaceutical Manufacturing: An Empirical Benchmark on Consumer-Grade Hardware
Title（参考訳）: バイオ医薬品製造におけるNatural-Language-to-SQLクエリのためのローカルLCMのベンチマーク:コンシューマーグレードハードウェアに関する実証ベンチマーク
Authors: Sagar Bhetwal, Rajan Bastakoti, Nirajan Acharya, Gaurav Kumar Gupta,
Abstract要約: 大規模言語モデル(LLM)は、プライバシー保護の代替手段を提供するが、医薬品の製造タスクに適合する可能性はまだ未調査である。本研究は, 製薬データベース上で, オラマを介してローカルに展開した4つのオープンソースLCMについて, 自然言語から世代への展開について検討した。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Biopharmaceutical manufacturing organizations operate under regulatory frameworks such as FDA guidance, EU Good Manufacturing Practice (GMP), and the EU AI Act, which can restrict the use of cloud-based artificial intelligence systems. Locally deployed large language models (LLMs) offer a privacy-preserving alternative, but their suitability for pharmaceutical manufacturing tasks remains underexplored. This study evaluates four open-source LLMs (Qwen 2.5 Coder 7B, Llama 3.1 8B, Mistral 7B, and Meditron 7B) deployed locally via Ollama for natural-language-to-SQL generation over a pharmaceutical manufacturing database. A FastAPI-based evaluation platform, PharmaBatchDB AI, was developed using a synthetic Microsoft SQL Server database containing approximately 63,000 records across Batch, Manufacturing Execution System (MES), and Clean-In-Place (CIP) modules. Models were benchmarked on 60 domain-specific natural-language questions using metrics including SQL extraction rate, SQL compliance, factual consistency, ROUGE-L, hallucination rate, throughput, and latency. Qwen 2.5 Coder 7B, Llama 3.1 8B, and Mistral 7B generated SQL for all evaluation tasks, while Meditron 7B failed on nearly all tasks due to context-window limitations and poor SQL generation capability. Llama 3.1 8B achieved the highest SQL compliance, whereas Qwen 2.5 Coder 7B achieved the strongest overall text similarity and factual consistency. Performance differences between the two leading models were not statistically significant. The results show that code-tuned general-purpose LLMs outperform a domain-specific biomedical model on structured query generation for pharmaceutical manufacturing data. Although fully local, GxP-aligned NLQ systems are feasible on consumer hardware, current performance levels still require human oversight and downstream validation for regulated use.
Abstract（参考訳）: バイオ医薬品製造組織は、FDAガイダンス、EUグッドマニュファクチャリングプラクティス(GMP)、およびクラウドベースの人工知能システムの使用を制限するEU AI法などの規制枠組みの下で運営されている。ローカルにデプロイされた大規模言語モデル(LLM)は、プライバシ保護の代替手段を提供するが、医薬品製造タスクへの適合性はまだ未定である。本研究では,Ollama経由でローカルにデプロイされた4つのオープンソースLCM(Qwen 2.5 Coder 7B, Llama 3.1 8B, Mistral 7B, Meditron 7B)を,製薬データベース上での自然言語-SQL生成のために評価した。 FastAPIベースの評価プラットフォームであるPharmaBatchDB AIは、バッチ、製造実行システム(MES)、クリーン・イン・プレイス(CIP)モジュール全体で約63,000のレコードを含む合成Microsoft SQL Serverデータベースを使用して開発された。モデルは、SQL抽出率、SQLコンプライアンス、事実整合性、ROUGE-L、幻覚率、スループット、レイテンシなどのメトリクスを使用して、60のドメイン固有の自然言語質問でベンチマークされた。 Qwen 2.5 Coder 7B、Llama 3.1 8B、Mistral 7Bはすべての評価タスクでSQLを生成したが、Meditron 7Bはコンテキストウインドウの制限とSQL生成能力の低さのためにほぼすべてのタスクで失敗した。 Llama 3.1 8BはSQLのコンプライアンスを最高に達成し、Qwen 2.5 Coder 7Bはテキストの類似性と実際の一貫性を最強に達成した。両モデル間の性能差は統計的に有意ではなかった。その結果, 汎用LCMは医薬品製造データに対する構造化クエリ生成において, ドメイン固有のバイオメディカルモデルよりも優れていた。完全にローカルなGxP対応のNLQシステムは、コンシューマハードウェアで実現可能であるが、現在のパフォーマンスレベルは規制された使用のために人間の監視と下流の検証を必要とする。

論文の概要: Benchmarking Local LLMs for Natural-Language-to-SQL Querying in Biopharmaceutical Manufacturing: An Empirical Benchmark on Consumer-Grade Hardware

関連論文リスト