Related papers: ReXCL: A Tool for Requirement Document Extraction and Classification

Related papers

ExStrucTiny: A Benchmark for Schema-Variable Structured Information Extraction from Document Images [19.490609860018804]
We introduce ExStrucTiny, a new benchmark dataset for structured Information Extraction (IE) from document images.<n>Built through a novel pipeline combining manual and synthetic human-validated samples, ExStrucTiny covers more varied document types and extraction scenarios.<n>We analyze open and closed Vision Language Models on this benchmark, highlighting challenges such as adaptation, query under-specification, and schema adaptation.
arXiv Detail & Related papers (2026-02-12T17:38:57Z)
MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns [80.05126590825121]
MonkeyOCR v1.5 is a unified vision-language framework that enhances both layout understanding and content recognition.<n>To address complex table structures, we propose a visual consistency-based reinforcement learning scheme.<n>Two specialized modules, Image-Decoupled Table Parsing and Type-Guided Table Merging, are introduced to enable reliable parsing of tables.
arXiv Detail & Related papers (2025-11-13T15:12:17Z)
Hybrid OCR-LLM Framework for Enterprise-Scale Document Information Extraction Under Copy-heavy Task [11.672798725644121]
This work strategically combines OCR engines with Large Language Models (LLMs) to optimize the accuracy-efficiency trade-off inherent in repetitive document extraction tasks.<n>We implement and evaluate 25 configurations across three extraction paradigms (direct, replacement, and table-based) on identity documents spanning four formats.
arXiv Detail & Related papers (2025-10-11T09:40:34Z)
MeXtract: Light-Weight Metadata Extraction from Scientific Papers [48.73595915402094]
We present MeXtract, a family of lightweight language models designed for metadata extraction from scientific papers.<n>MeXtract achieves state-of-the-art performance on metadata extraction on the MOLE benchmark.<n>We release all the code, datasets, and models openly for the research community.
arXiv Detail & Related papers (2025-10-08T11:12:28Z)
POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion [32.52489423671728]
High-quality labeled data is essential for training accurate document conversion models.<n>We propose a fully automated framework comprising two stages for constructing high-quality document extraction datasets.<n>We train a public POINTS-1.5 model to obtain POINTS-Reader, which surpasses many existing public and proprietary models of comparable or larger size.
arXiv Detail & Related papers (2025-09-01T07:54:18Z)
Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments [70.42705564227548]
We propose an automated environment construction pipeline for large language models (LLMs)<n>This enables the creation of high-quality training environments that provide detailed and measurable feedback without relying on external tools.<n>We also introduce a verifiable reward mechanism that evaluates both the precision of tool use and the completeness of task execution.
arXiv Detail & Related papers (2025-08-12T09:45:19Z)
Leveraging Machine Learning and Enhanced Parallelism Detection for BPMN Model Generation from Text [75.77648333476776]
This paper introduces an automated pipeline for extracting BPMN models from text.<n>A key contribution of this work is the introduction of a newly annotated dataset.<n>We augment the dataset with 15 newly annotated documents containing 32 parallel gateways for model training.
arXiv Detail & Related papers (2025-07-11T07:25:55Z)
ProxAnn: Use-Oriented Evaluations of Topic Models and Document Clustering [52.19512723549318]
We design a scalable human evaluation protocol that reflects practitioners' real-world usage of models.<n>We use this protocol to collect extensive crowdworker annotations of outputs from a diverse set of topic models.<n>We then use these annotations to validate automated proxies, finding that the best LLM proxies are statistically indistinguishable from a human annotator.
arXiv Detail & Related papers (2025-07-01T15:00:55Z)
Vision to Specification: Automating the Transition from Conceptual Features to Functional Requirements [10.85799957734291]
The EasyFR approach recommends Semantic Role Labeling sequences for the given abstract features to guide Pre-trained Language Models (PLMs) in producing cohesive functional requirements (FRs)<n>Our results indicate a notable step forward in the realm of automated requirements synthesis, holding potential to improve the process of requirements specification in future software projects.
arXiv Detail & Related papers (2025-05-18T07:01:50Z)
ToolACE-R: Tool Learning with Adaptive Self-Refinement [84.69651852838794]
Tool learning allows Large Language Models to leverage external tools for solving complex user tasks.<n>We propose ToolACE-R, a novel method that introduces adaptive self-refinement for tool invocations.<n>Our results demonstrate the effectiveness of the proposed method, which is compatible with base models of various sizes.
arXiv Detail & Related papers (2025-04-02T06:38:56Z)
Self-Refinement Strategies for LLM-based Product Attribute Value Extraction [51.45146101802871]
This paper investigates applying two self-refinement techniques to the product attribute value extraction task.<n>The experiments show that both self-refinement techniques fail to significantly improve the extraction performance while substantially increasing processing costs.<n>For scenarios with development data, fine-tuning yields the highest performance, while the ramp-up costs of fine-tuning are balanced out as the amount of product descriptions increases.
arXiv Detail & Related papers (2025-01-02T12:55:27Z)
Adaptable Embeddings Network (AEN) [49.1574468325115]
We introduce Adaptable Embeddings Networks (AEN), a novel dual-encoder architecture using Kernel Density Estimation (KDE) AEN allows for runtime adaptation of classification criteria without retraining and is non-autoregressive. The architecture's ability to preprocess and cache condition embeddings makes it ideal for edge computing applications and real-time monitoring systems.
arXiv Detail & Related papers (2024-11-21T02:15:52Z)
Matchmaker: Self-Improving Large Language Model Programs for Schema Matching [60.23571456538149]
We propose a compositional language model program for schema matching, comprised of candidate generation, refinement and confidence scoring. Matchmaker self-improves in a zero-shot manner without the need for labeled demonstrations. Empirically, we demonstrate on real-world medical schema matching benchmarks that Matchmaker outperforms previous ML-based approaches.
arXiv Detail & Related papers (2024-10-31T16:34:03Z)
Digital requirements engineering with an INCOSE-derived SysML meta-model [0.0]
We extend the Model-Based Structured Requirement SysML Profile to comply with the INCOSE Guide to Writing Requirements. The resulting SysML Profile was applied in two system architecture models at NASA Jet Propulsion Laboratory.
arXiv Detail & Related papers (2024-10-12T03:06:13Z)
CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation [51.2289822267563]
We propose Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method for generating synthetic datasets. We use large-scale public web-crawled corpora and similarity-based document retrieval to find other relevant human-written documents. We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks.
arXiv Detail & Related papers (2024-09-03T17:54:40Z)
Automated Configuration Synthesis for Machine Learning Models: A git-Based Requirement and Architecture Management System [5.095988654970361]
This work introduces a tool for generating runtime configurations automatically from textual requirements stored as artifacts in git repositories (a.k.a. T-Reqs) alongside the software code. The tool leverages T-Reqs-modelled architectural description to identify relevant configuration properties for the deployment of artificial intelligence (AI)-enabled software systems.
arXiv Detail & Related papers (2024-04-26T08:35:02Z)
Retrieval as Attention: End-to-end Learning of Retrieval and Reading within a Single Transformer [80.50327229467993]
We show that a single model trained end-to-end can achieve both competitive retrieval and QA performance. We show that end-to-end adaptation significantly boosts its performance on out-of-domain datasets in both supervised and unsupervised settings.
arXiv Detail & Related papers (2022-12-05T04:51:21Z)
Natural Language Processing for Systems Engineering: Automatic Generation of Systems Modelling Language Diagrams [0.10312968200748115]
An approach is proposed to assist systems engineers in the automatic generation of systems diagrams from unstructured natural language text. The intention is to provide the users with a more standardised, comprehensive and automated starting point onto which subsequently refine and adapt the diagrams according to their needs.
arXiv Detail & Related papers (2022-08-09T19:20:33Z)
Document-level Entity-based Extraction as Template Generation [13.110360825201044]
We propose a generative framework for two document-level EE tasks: role-filler entity extraction (REE) and relation extraction (RE) We first formulate them as a template generation problem, allowing models to efficiently capture cross-entity dependencies. A novel cross-attention guided copy mechanism, TopK Copy, is incorporated into a pre-trained sequence-to-sequence model to enhance the capabilities of identifying key information.
arXiv Detail & Related papers (2021-09-10T14:18:22Z)
GRIT: Generative Role-filler Transformers for Document-level Event Entity Extraction [134.5580003327839]
We introduce a generative transformer-based encoder-decoder framework (GRIT) to model context at the document level. We evaluate our approach on the MUC-4 dataset, and show that our model performs substantially better than prior work.
arXiv Detail & Related papers (2020-08-21T01:07:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.