Extract-0: A Specialized Language Model for Document Information Extraction
- URL: http://arxiv.org/abs/2509.22906v1
- Date: Fri, 26 Sep 2025 20:34:43 GMT
- Title: Extract-0: A Specialized Language Model for Document Information Extraction
- Authors: Henrique Godoy,
- Abstract summary: This paper presents Extract-0, a 7-billion parameter language model specifically optimized for document information extraction.<n>Extract-0 achieves a mean reward of 0.573 on a benchmark of 1,000 diverse document extraction tasks, outperforming GPT-4.1 (0.457), o3 (0.464), and GPT-4.1-2025 (0.459).
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents Extract-0, a 7-billion parameter language model specifically optimized for document information extraction that achieves performance exceeding models with parameter counts several orders of magnitude larger. Through a novel combination of synthetic data generation, supervised fine-tuning with Low-Rank Adaptation (LoRA), and reinforcement learning via Group Relative Policy Optimization (GRPO), Extract-0 achieves a mean reward of 0.573 on a benchmark of 1,000 diverse document extraction tasks, outperforming GPT-4.1 (0.457), o3 (0.464), and GPT-4.1-2025 (0.459). The training methodology employs a memory-preserving synthetic data generation pipeline that produces 280,128 training examples from diverse document sources, followed by parameterefficient fine-tuning that modifies only 0.53% of model weights (40.4M out of 7.66B parameters). The reinforcement learning phase introduces a novel semantic similarity-based reward function that handles the inherent ambiguity in information extraction tasks. This research demonstrates that task-specific optimization can yield models that surpass general-purpose systems while requiring substantially fewer computational resource.
Related papers
- LOCUS: A System and Method for Low-Cost Customization for Universal Specialization [4.151679589098346]
We present LOCUS, a pipeline that consumes few-shot data to streamline the construction and training of NLP models.<n>With only a small number of labeled examples, LOCUS discovers pertinent data in a broad repository, synthesizes additional training samples via in-context data generation, and fine-tunes models.
arXiv Detail & Related papers (2025-12-06T01:32:58Z) - Scaling Open-Weight Large Language Models for Hydropower Regulatory Information Extraction: A Systematic Analysis [0.2555802789878571]
Information extraction from regulatory documents using large language models presents critical trade-offs between performance and computational resources.<n>We evaluated seven open-weight models (0.6B-70B parameters) on hydropower licensing documentation to provide empirical deployment guidance.
arXiv Detail & Related papers (2025-11-14T19:23:25Z) - SPaRFT: Self-Paced Reinforcement Fine-Tuning for Large Language Models [51.74498855100541]
Large language models (LLMs) have shown strong reasoning capabilities when fine-tuned with reinforcement learning (RL)<n>We propose textbfSPaRFT, a self-paced learning framework that enables efficient learning based on the capability of the model being trained.
arXiv Detail & Related papers (2025-08-07T03:50:48Z) - A Reproducible, Scalable Pipeline for Synthesizing Autoregressive Model Literature [0.0]
Autoregressive generative models have produced thousands of papers, making manual literature surveys and reproduction studies increasingly impractical.<n>We present a fully open-source, reproducible pipeline that automatically retrieves candidate documents from public repositories.<n>Experiments on corpora of up to 1000 papers demonstrate near-linear scalability with eight CPU workers.
arXiv Detail & Related papers (2025-08-06T16:33:20Z) - Combatting Dimensional Collapse in LLM Pre-Training Data via Diversified File Selection [65.96556073745197]
DiverSified File selection algorithm (DiSF) is proposed to select the most decorrelated text files in the feature space.<n>DiSF saves 98.5% of 590M training files in SlimPajama, outperforming the full-data pre-training within a 50B training budget.
arXiv Detail & Related papers (2025-04-29T11:13:18Z) - Enhanced Multi-Tuple Extraction for Alloys: Integrating Pointer Networks and Augmented Attention [6.938202451113495]
We present a novel framework that combines an extraction model based on MatSciBERT with pointer and an allocation model.<n>Our experiments on extraction demonstrate impressive F1 scores of 0.947, 0.93 and 0.753 across datasets.<n>These results highlight the model's capacity to deliver precise and structured information.
arXiv Detail & Related papers (2025-03-10T02:39:06Z) - Leveraging large language models for structured information extraction from pathology reports [0.0]
We evaluate large language models' accuracy in extracting structured information from breast cancer histopathology reports.<n>Open-source tool for structured information extraction can be customized by non-programmers using natural language.
arXiv Detail & Related papers (2025-02-14T21:46:02Z) - Clear Preferences Leave Traces: Reference Model-Guided Sampling for Preference Learning [59.11519451499754]
Direct Preference Optimization (DPO) has emerged as a de-facto approach for aligning language models with human preferences.<n>Recent work has shown DPO's effectiveness relies on training data quality.<n>We discover that reference model probability space naturally detects high-quality training samples.
arXiv Detail & Related papers (2025-01-25T07:21:50Z) - iTool: Reinforced Fine-Tuning with Dynamic Deficiency Calibration for Advanced Tool Use [56.31110409360567]
Augmenting large language models with external tools is a promising approach to enhance their capabilities.<n>We show that training gains significantly decay as synthetic data increases.<n>We propose an iterative reinforced fine-tuning strategy designed to alleviate this limitation.
arXiv Detail & Related papers (2025-01-15T04:52:34Z) - Crafting Efficient Fine-Tuning Strategies for Large Language Models [2.633490094119608]
Fine-tuning large language models (LLMs) with as few as 200 samples can improve model accuracy from 70% to 88% in a product attribute extraction task.
A bayesian hyperparameter optimization method, which evaluates models at 20% of total training time, correlates strongly with final model performance.
This approach led to a 2% improvement in accuracy over baseline models when evaluated on an independent test set.
arXiv Detail & Related papers (2024-07-18T21:36:00Z) - Clustering and Ranking: Diversity-preserved Instruction Selection through Expert-aligned Quality Estimation [56.13803674092712]
We propose an industrial-friendly, expert-aligned and diversity-preserved instruction data selection method: Clustering and Ranking (CaR)
CaR employs a two-step process: first, it ranks instruction pairs using a high-accuracy (84.25%) scoring model aligned with expert preferences; second, it preserves dataset diversity through clustering.
In our experiment, CaR efficiently selected a mere 1.96% of Alpaca's IT data, yet the resulting AlpaCaR model surpassed Alpaca's performance by an average of 32.1% in GPT-4 evaluations.
arXiv Detail & Related papers (2024-02-28T09:27:29Z) - Towards Efficient Vision-Language Tuning: More Information Density, More Generalizability [73.34532767873785]
We propose the concept of Information Density'' (ID) to indicate whether a matrix strongly belongs to certain feature spaces.
We introduce the Dense Information Prompt (DIP) to enhance information density to improve generalization.
DIP significantly reduces the number of tunable parameters and the requisite storage space, making it particularly advantageous in resource-constrained settings.
arXiv Detail & Related papers (2023-12-17T20:42:43Z) - Improving Zero and Few-Shot Abstractive Summarization with Intermediate
Fine-tuning and Data Augmentation [101.26235068460551]
Models pretrained with self-supervised objectives on large text corpora achieve state-of-the-art performance on English text summarization tasks.
Models are typically fine-tuned on hundreds of thousands of data points, an infeasible requirement when applying summarization to new, niche domains.
We introduce a novel and generalizable method, called WikiTransfer, for fine-tuning pretrained models for summarization in an unsupervised, dataset-specific manner.
arXiv Detail & Related papers (2020-10-24T08:36:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.