Fugu-MT 論文翻訳(概要): DELM: a Python toolkit for Data Extraction with Language Models

論文の概要: DELM: a Python toolkit for Data Extraction with Language Models

arxiv url: http://arxiv.org/abs/2509.20617v1
Date: Wed, 24 Sep 2025 23:47:55 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-26 20:58:12.626757
Title: DELM: a Python toolkit for Data Extraction with Language Models
Title（参考訳）: DELM: 言語モデルによるデータ抽出のためのPythonツールキット
Authors: Eric Fithian, Kirill Skobelev,
Abstract要約: DELM(Data extract with Language Models)は、オープンソースのPythonツールキットで、データ抽出パイプラインの迅速な実験的なイテレーション用に設計されている。定型的なコードを最小限にし、構造化出力、ビルトインバリデーション、フレキシブルなデータローディングとスコアリング戦略、効率的なバッチ処理を備えたモジュール化されたフレームワークを提供する。また、再試行ロジック、結果キャッシング、詳細なコストトラッキング、包括的な構成管理など、LLM APIの動作に対する堅牢なサポートも含まれている。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large Language Models (LLMs) have become powerful tools for annotating unstructured data. However, most existing workflows rely on ad hoc scripts, making reproducibility, robustness, and systematic evaluation difficult. To address these challenges, we introduce DELM (Data Extraction with Language Models), an open-source Python toolkit designed for rapid experimental iteration of LLM-based data extraction pipelines and for quantifying the trade-offs between them. DELM minimizes boilerplate code and offers a modular framework with structured outputs, built-in validation, flexible data-loading and scoring strategies, and efficient batch processing. It also includes robust support for working with LLM APIs, featuring retry logic, result caching, detailed cost tracking, and comprehensive configuration management. We showcase DELM's capabilities through two case studies: one featuring a novel prompt optimization algorithm, and another illustrating how DELM quantifies trade-offs between cost and coverage when selecting keywords to decide which paragraphs to pass to an LLM. DELM is available at \href{https://github.com/Center-for-Applied-AI/delm}{\texttt{github.com/Center-for-Applied-AI/delm}}.
Abstract（参考訳）: 大規模言語モデル(LLM)は、構造化されていないデータを注釈付けするための強力なツールになっている。しかし、既存のワークフローのほとんどはアドホックなスクリプトに依存しており、再現性、堅牢性、体系的な評価が難しい。これらの課題に対処するために,LLMベースのデータ抽出パイプラインの迅速な実験イテレーションと,それらの間のトレードオフの定量化を目的とした,オープンソースのPythonツールキットであるDELM(Data extract with Language Models)を紹介した。 DELMはボイラプレートコードを最小化し、構造化出力、ビルトインバリデーション、フレキシブルなデータローディングとスコアリング戦略、効率的なバッチ処理を備えたモジュール化されたフレームワークを提供する。また、再試行ロジック、結果キャッシング、詳細なコストトラッキング、包括的な構成管理など、LLM APIの堅牢なサポートも含まれている。 1つは新しいプロンプト最適化アルゴリズムを特徴とし、もう1つは、LDMにどの段落を渡すかを決めるキーワードを選択する際に、DELMがコストとカバレッジの間のトレードオフを定量化する方法を示している。 DELM は \href{https://github.com/Center-for-Applied-AI/delm}{\texttt{github.com/Center-for-Applied-AI/delm}} で利用可能である。

論文の概要: DELM: a Python toolkit for Data Extraction with Language Models

関連論文リスト