Fugu-MT 論文翻訳(概要): The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models

論文の概要: The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models

arxiv url: http://arxiv.org/abs/2604.25359v1
Date: Tue, 28 Apr 2026 08:27:01 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-29 16:49:17.772061
Title: The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models
Title（参考訳）: 構造化出力ベンチマーク:大規模言語モデルにおける構造化出力品質の評価のためのマルチソースベンチマーク
Authors: Abhinav Kumar Singh, Harsha Vardhan Khurdula, Yoeven D Khemlani, Vineet Agarwal,
Abstract要約: SOB(Structured Output Benchmark)は3つのソースにまたがるマルチソースベンチマークである。すべてのモデルは、ソースのモダリティに関係なく、コンテキストのテキスト正規化表現を受け取る。モデルは、ほぼ完璧なスキーマコンプライアンスを実現するが、正確な葉値マッチングによって測定された最高の値精度は、テキスト上では83.0%にしか達しない。
参考スコア（独自算出の注目度）: 0.23332469289621785
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models are increasingly being deployed to extract structured data from unstructured and semi-structured sources: parsing invoices, medical records, and converting PDF documents to database entries. Yet existing benchmarks for structured output generation either focus on schema compliance alone, or evaluate value correctness within a single source domain. We introduce SOB (The Structured Output Benchmark), a multi-source benchmark spanning three source modalities: native text, images, and audio conversations. All models receive a text-normalized representation of their context regardless of source modality; this deliberate design isolates structured-output capability from raw vision or speech-processing quality, ensuring a fair, source-agnostic comparison. Our benchmark comprises 5,000 text evaluation records derived from multi-hop QA drawn from a 25,091-record full corpus, 209 image records from OCR-processed PDFs across seven document types including multi-column layouts, dense tables, scanned historical documents, small-print text, and mathematical typesetting, and 115 audio records from the AMI corpus. Each record pairs a natural-language question with a JSON schema that the model must follow and a ground-truth answer verified against the source context. We evaluate 21 frontier and open-weight models across three source domains and seven metrics. Our results reveal a consistent pattern: models achieve near-perfect schema compliance, yet the best Value Accuracy, measured by exact leaf-value match, reaches only 83.0% on text, 67.2% on images, and 23.7% on audio, where longer context makes extraction substantially harder. We release the dataset, evaluation pipeline, and all related code.
Abstract（参考訳）: 大規模言語モデルは、請求書のパース、医療記録、PDF文書のデータベースエントリへの変換など、構造化されていない、あるいは半構造化されていないソースから構造化されたデータを抽出するために、ますますデプロイされている。しかし、構造化された出力生成のための既存のベンチマークは、スキーマコンプライアンスのみに焦点を当てるか、または単一のソースドメイン内の値の正確性を評価する。 SOB(The Structured Output Benchmark)は、ネイティブテキスト、画像、音声会話という3つのソースモードにまたがるマルチソースベンチマークである。この意図的な設計は、構造化された出力能力を生の視覚や音声処理の品質から切り離し、公平でソースに依存しない比較を確実にする。ベンチマークでは,25,091個の全コーパスから抽出されたマルチホップQAから得られた5000個のテキスト評価記録,OCR処理したPDFからの209個の画像記録,マルチカラムレイアウト,高密度テーブル,スキャンされた歴史文書,小文字テキスト,数式分類を含む7種類の文書タイプ,AMIコーパスからの115個のオーディオ記録を含む。各レコードは、自然言語の質問と、モデルが従わなければならないJSONスキーマと、ソースコンテキストに対して検証された根本的真実の回答とをペアリングする。 3つのソースドメインと7つのメトリクスにわたる21のフロンティアとオープンウェイトモデルを評価します。モデルは、ほぼ完全なスキーマコンプライアンスを実現するが、正確な葉値マッチングによって測定される最高の値精度は、テキストで83.0%、画像で67.2%、オーディオで23.7%にしか達しない。データセット、評価パイプライン、および関連するすべてのコードをリリースします。

論文の概要: The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models

関連論文リスト