Fugu-MT 論文翻訳(概要): The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data

論文の概要: The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data

arxiv url: http://arxiv.org/abs/2606.18192v2
Date: Wed, 17 Jun 2026 17:09:10 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-18 13:57:35.228256
Title: The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data
Title（参考訳）: Stanford EDGAR Filings Dataset:アメリカの企業および財務開示をLayout-Fithful and Token-Efficient Pretraining Dataに再構成する
Authors: Nick Bettencourt, Xiaowei Ding, Kay Giesecke,
Abstract要約: Stanford EDGAR Filingsデータセットは、SECの申請をレイアウトに忠実なMultiMarkdownに再構築したものである。 SEFDは、監査済みの財務声明、リスク開示、所有権報告書、会計ノート、および長期の事前訓練データとして利用可能な市場移動イベントの提出を行う。我々は152Bの初期の公開スナップショットであるSEFD-v1をリリースし,550Bトークンと推定される18.5Mの大規模なアーカイブをコーパスレベルで解析した。
参考スコア（独自算出の注目度）: 1.6404022072626985
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: As high-quality public web corpora become increasingly exhausted, clean long-context documents have become a scarce and expensive source of training data for large language models (LLMs). Existing long-context corpora are often proprietary and costly to acquire, synthetically generated, or concentrated in narrow domains such as programming. We introduce the Stanford EDGAR Filings Dataset (SEFD), an open reconstruction of SEC filings into layout-faithful MultiMarkdown for financial language modeling and evaluation. SEFD makes audited financial statements, risk disclosures, ownership reports, accounting notes, and market-moving event filings usable as long-context pretraining data and as a basis for financial reasoning, forecasting, compliance, and document understanding. The resulting corpus is token-efficient, model-ready, and has less than 0.1% overlap with Common Crawl-derived corpora. We release SEFD-v1, a 152B-token initial public snapshot, and provide corpus-level analyses of a larger 18.5M-filing archive estimated at 550B tokens. We further introduce two SEFD-derived benchmarks: EDGAR-Forecast, which evaluates filing-grounded numerical forecasting after model knowledge cutoffs, and EDGAR-OCR, which evaluates transcription of complex financial tables.
Abstract（参考訳）: 高品質なウェブコーパスが枯渇するにつれて、クリーンな長文文書は大規模言語モデル(LLM)のトレーニングデータの不足と高価な情報源となっている。既存の長文コーパスは、しばしばプロプライエタリでコストがかかり、プログラミングのような狭い領域で取得、合成、集中する。我々は、Stanford EDGAR Filings Dataset (SEFD)を導入し、SECの申請書をレイアウトに忠実なMultiMarkdownにオープンに再構築し、金融言語モデリングと評価を行う。 SEFDは、監査された財務声明、リスク開示、所有権報告書、会計ノート、市場移動イベントの提出を、長期の事前訓練データとして使用でき、財務分析、予測、コンプライアンス、文書理解の基盤として提供している。得られたコーパスはトークン効率が高く、モデルレディであり、Common Crawl由来のコーパスと0.1%未満のオーバーラップがある。我々は152Bの初期の公開スナップショットであるSEFD-v1をリリースし,550Bトークンと推定される18.5Mの大規模なアーカイブをコーパスレベルで解析した。さらに、モデル知識遮断後の数値予測を行うEDGAR-Forecastと、複雑な財務表の書き起こしを評価するEDGAR-OCRの2つのSEFDベースベンチマークを紹介する。

論文の概要: The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data

関連論文リスト