Fugu-MT 論文翻訳(概要): Leveraging Large Language Models for Automated Scalable Development of Open Scientific Databases

論文の概要: Leveraging Large Language Models for Automated Scalable Development of Open Scientific Databases

arxiv url: http://arxiv.org/abs/2603.07050v1
Date: Sat, 07 Mar 2026 05:58:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:13.707611
Title: Leveraging Large Language Models for Automated Scalable Development of Open Scientific Databases
Title（参考訳）: オープンサイエンスデータベースのスケーラブルな自動開発のための大規模言語モデルの活用
Authors: Nikita Gautam, Doina Caragea, Ignacio Ciampitti, Federico Gomez,
Abstract要約: 本稿では,Large Language Models (LLMs) を利用して,オープンサイエンスデータベースの自動化とスケーラブルな開発を行う Web ベースのツールを紹介する。このツールは、キーワードベースのクエリ、API可能なデータ検索、LLMベースのテキスト分類を組み合わせた、自動化および統一されたフレームワークに基づいている。提案するフレームワークはスケーラブルかつドメインに依存しないものであり、スケーラブルなオープンサイエンスデータベースを構築するためにさまざまな分野に適用することができる。
参考スコア（独自算出の注目度）: 3.332543256537694
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the exponential increase in online scientific literature, identifying reliable domain-specific data has become increasingly important but also very challenging. Manual data collection and filtering for domain-specific scientific literature is not only time-consuming but also labor-intensive and prone to errors and inconsistencies. To facilitate automated data collection, the paper introduces a web-based tool that leverages Large Language Models (LLMs) for automated and scalable development of open scientific databases. More specifically, the tool is based on an automated and unified framework that combines keyword-based querying, API-enabled data retrieval, and LLM-powered text classification to construct domain-specific scientific databases. Data is collected from multiple reliable data sources and search engines using a parallel querying technique to construct a combined unified dataset. The dataset is subsequently filtered using LLMs queried with prompts tailored for each keyword-based query to extract the relevant data to a scientific query of interest. The approach was tested across a set of variable keyword-based searches for different domain-specific tasks related to agriculture and crop yield. The results and analysis show 90\% overlap with small domain expert-curated databases, suggesting that the proposed tool can be used to significantly reduce manual workload. Furthermore, the proposed framework is both scalable and domain-agnostic and can be applied across diverse fields for building scalable open scientific databases.
Abstract（参考訳）: オンライン科学文献の急激な増加に伴い、信頼性の高いドメイン固有データを特定することはますます重要になっているが、非常に難しい。ドメイン固有の科学文献のための手動データ収集とフィルタリングは、時間を要するだけでなく、労働集約的であり、誤りや矛盾を招きやすい。自動データ収集を容易にするために,オープンサイエンスデータベースの自動かつスケーラブルな開発のために,LLM(Large Language Models)を活用したWebベースのツールを提案する。より具体的に言うと、このツールは、キーワードベースのクエリ、API対応データ検索、LLMベースのテキスト分類を組み合わせて、ドメイン固有の科学データベースを構築する、自動化され統一されたフレームワークに基づいている。複数の信頼性のあるデータソースと検索エンジンから並列クエリ技術を用いてデータを収集し、統合されたデータセットを構築する。データセットはその後、各キーワードベースのクエリに適したプロンプトでクエリされたLLMを使用してフィルタリングされ、関連するデータを科学的なクエリに抽出する。この手法は、農業と収量に関連する異なるドメイン固有のタスクに対する可変キーワードベースの検索のセットでテストされた。結果と分析の結果,小ドメインの専門家によるデータベースと90%の重複がみられ,手作業量を大幅に削減できる可能性が示唆された。さらに、提案するフレームワークはスケーラブルかつドメインに依存しないものであり、スケーラブルなオープンサイエンスデータベースを構築するために様々な分野に適用することができる。

論文の概要: Leveraging Large Language Models for Automated Scalable Development of Open Scientific Databases

関連論文リスト