Related papers: Automated Parsing of Engineering Drawings for Structured Information Extraction Using a Fine-tuned Document Understanding Transformer

Automated Parsing of Engineering Drawings for Structured Information Extraction Using a Fine-tuned Document Understanding Transformer

URL: http://arxiv.org/abs/2505.01530v1
Date: Fri, 02 May 2025 18:33:21 GMT
Title: Automated Parsing of Engineering Drawings for Structured Information Extraction Using a Fine-tuned Document Understanding Transformer
Authors: Muhammad Tayyab Khan, Zane Yong, Lequn Chen, Jun Ming Tan, Wenhe Feng, Seung Ki Moon,
Abstract summary: This paper proposes a novel hybrid deep learning framework for structured information extraction.<n>It integrates an oriented bounding box (OBB) model with a transformer-based document parsing model (Donut)<n>The proposed framework improves accuracy, reduces manual effort, and supports scalable deployment in precision-driven industries.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Accurate extraction of key information from 2D engineering drawings is crucial for high-precision manufacturing. Manual extraction is time-consuming and error-prone, while traditional Optical Character Recognition (OCR) techniques often struggle with complex layouts and overlapping symbols, resulting in unstructured outputs. To address these challenges, this paper proposes a novel hybrid deep learning framework for structured information extraction by integrating an oriented bounding box (OBB) detection model with a transformer-based document parsing model (Donut). An in-house annotated dataset is used to train YOLOv11 for detecting nine key categories: Geometric Dimensioning and Tolerancing (GD&T), General Tolerances, Measures, Materials, Notes, Radii, Surface Roughness, Threads, and Title Blocks. Detected OBBs are cropped into images and labeled to fine-tune Donut for structured JSON output. Fine-tuning strategies include a single model trained across all categories and category-specific models. Results show that the single model consistently outperforms category-specific ones across all evaluation metrics, achieving higher precision (94.77% for GD&T), recall (100% for most), and F1 score (97.3%), while reducing hallucination (5.23%). The proposed framework improves accuracy, reduces manual effort, and supports scalable deployment in precision-driven industries.

Related papers

From Drawings to Decisions: A Hybrid Vision-Language Framework for Parsing 2D Engineering Drawings into Structured Manufacturing Knowledge [0.0]
Key information from 2D engineering drawings is essential for advancing digital manufacturing.<n>Manual extraction is slow and labor-intensive, while generic OCR models often fail due to complex layouts, engineering symbols, and rotated text.<n>We propose a hybrid vision-language framework that integrates a rotation-aware object detection model (YOLOv11-obb) with a transformer-based vision-language model (VLM)
arXiv Detail & Related papers (2025-06-20T17:10:01Z)
Textile Analysis for Recycling Automation using Transfer Learning and Zero-Shot Foundation Models [6.932314083921248]
This paper investigates the use of standard RGB imagery, a cost-effective sensing modality, for key pre-processing tasks in an automated system.<n>We present computer vision components designed for a conveyor belt setup to perform (a) classification of four common textile types and (b) segmentation of non- textile features such as buttons and zippers.
arXiv Detail & Related papers (2025-06-06T22:49:53Z)
TSPulse: Dual Space Tiny Pre-Trained Models for Rapid Time-Series Analysis [12.034816114258803]
TSPulse is an ultra-compact time-series pre-trained model with only 1M parameters.<n>It performs strongly across classification, anomaly detection, imputation, and retrieval tasks.<n>Results are achieved with just 1M parameters (10-100X smaller than existing SOTA models)
arXiv Detail & Related papers (2025-05-19T12:18:53Z)
TWSSenti: A Novel Hybrid Framework for Topic-Wise Sentiment Analysis on Social Media Using Transformer Models [0.0]
This study explores a hybrid framework combining transformer-based models to improve sentiment classification accuracy and robustness.<n>The framework addresses challenges such as noisy data, contextual ambiguity, and generalization across diverse datasets.<n>This research highlights its applicability to real-world tasks such as social media monitoring, customer sentiment analysis, and public opinion tracking.
arXiv Detail & Related papers (2025-04-14T05:44:11Z)
Hybrid-Segmentor: A Hybrid Approach to Automated Fine-Grained Crack Segmentation in Civil Infrastructure [52.2025114590481]
We introduce Hybrid-Segmentor, an encoder-decoder based approach that is capable of extracting both fine-grained local and global crack features. This allows the model to improve its generalization capabilities in distinguish various type of shapes, surfaces and sizes of cracks. The proposed model outperforms existing benchmark models across 5 quantitative metrics (accuracy 0.971, precision 0.804, recall 0.744, F1-score 0.770, and IoU score 0.630), achieving state-of-the-art status.
arXiv Detail & Related papers (2024-09-04T16:47:16Z)
Open-Set Deepfake Detection: A Parameter-Efficient Adaptation Method with Forgery Style Mixture [58.60915132222421]
We introduce an approach that is both general and parameter-efficient for face forgery detection. We design a forgery-style mixture formulation that augments the diversity of forgery source domains. We show that the designed model achieves state-of-the-art generalizability with significantly reduced trainable parameters.
arXiv Detail & Related papers (2024-08-23T01:53:36Z)
MimiQ: Low-Bit Data-Free Quantization of Vision Transformers with Encouraging Inter-Head Attention Similarity [32.532780329341186]
Data-free quantization (DFQ) is a technique that creates a lightweight network from its full-precision counterpart without the original training data, often through a synthetic dataset.<n>Several DFQ methods have been proposed for vision transformer (ViT) architectures, but they fail to achieve efficacy in low-bit settings.<n>We devise MimiQ, a novel DFQ method designed for ViTs that enhances inter-head attention similarity.
arXiv Detail & Related papers (2024-07-29T13:57:40Z)
SINDER: Repairing the Singular Defects of DINOv2 [61.98878352956125]
Vision Transformer models trained on large-scale datasets often exhibit artifacts in the patch token they extract. We propose a novel fine-tuning smooth regularization that rectifies structural deficiencies using only a small dataset.
arXiv Detail & Related papers (2024-07-23T20:34:23Z)
Restore Anything Model via Efficient Degradation Adaptation [129.38475243424563]
RAM takes a unified path that leverages inherent similarities across various degradations to enable efficient and comprehensive restoration.<n> RAM's SOTA performance confirms RAM's SOTA performance, reducing model complexity by approximately 82% in trainable parameters and 85% in FLOPs.
arXiv Detail & Related papers (2024-07-18T10:26:53Z)
Prompt-Driven Building Footprint Extraction in Aerial Images with Offset-Building Model [13.09874563780039]
Existing methods based on instance segmentation suffer from poor generalization when extended to large-scale data production.<n>We propose a novel Offset-Building Model (OBM) to transform end-to-end algorithms into promptable methods.<n>Our model reduces offset errors by 16.6% and improves roof Intersection over Union (IoU) by 10.8% compared to other models.
arXiv Detail & Related papers (2023-10-25T15:44:50Z)
Extensive Evaluation of Transformer-based Architectures for Adverse Drug Events Extraction [6.78974856327994]
Adverse Event (ADE) extraction is one of the core tasks in digital pharmacovigilance. We evaluate 19 Transformer-based models for ADE extraction on informal texts. At the end of our analyses, we identify a list of take-home messages that can be derived from the experimental data.
arXiv Detail & Related papers (2023-06-08T15:25:24Z)
Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction [49.15931834209624]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world.<n>We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique.<n>By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z)
Condensing Two-stage Detection with Automatic Object Key Part Discovery [87.1034745775229]
Two-stage object detectors generally require excessively large models for their detection heads to achieve high accuracy. We propose that the model parameters of two-stage detection heads can be condensed and reduced by concentrating on object key parts. Our proposed technique consistently maintains original performance while waiving around 50% of the model parameters of common two-stage detection heads.
arXiv Detail & Related papers (2020-06-10T01:20:47Z)
End-to-End Object Detection with Transformers [88.06357745922716]
We present a new method that views object detection as a direct set prediction problem. Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components. The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss.
arXiv Detail & Related papers (2020-05-26T17:06:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.