Related papers: Fine-Tuning Vision-Language Model for Automated Engineering Drawing Information Extraction

Fine-Tuning Vision-Language Model for Automated Engineering Drawing Information Extraction

URL: http://arxiv.org/abs/2411.03707v1
Date: Wed, 06 Nov 2024 07:11:15 GMT
Title: Fine-Tuning Vision-Language Model for Automated Engineering Drawing Information Extraction
Authors: Muhammad Tayyab Khan, Lequn Chen, Ye Han Ng, Wenhe Feng, Nicholas Yew Jin Tan, Seung Ki Moon,
Abstract summary: Florence-2 is an open-source vision-automated model (VLM) It is trained on a dataset of 400 drawings with ground truth annotations provided by domain experts. It achieves a 29.95% increase in precision, a 37.75% increase in recall, a 52.40% improvement in F1-score, and a 43.15% reduction in hallucination rate.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Geometric Dimensioning and Tolerancing (GD&T) plays a critical role in manufacturing by defining acceptable variations in part features to ensure component quality and functionality. However, extracting GD&T information from 2D engineering drawings is a time-consuming and labor-intensive task, often relying on manual efforts or semi-automated tools. To address these challenges, this study proposes an automated and computationally efficient GD&T extraction method by fine-tuning Florence-2, an open-source vision-language model (VLM). The model is trained on a dataset of 400 drawings with ground truth annotations provided by domain experts. For comparison, two state-of-the-art closed-source VLMs, GPT-4o and Claude-3.5-Sonnet, are evaluated on the same dataset. All models are assessed using precision, recall, F1-score, and hallucination metrics. Due to the computational cost and impracticality of fine-tuning large closed-source VLMs for domain-specific tasks, GPT-4o and Claude-3.5-Sonnet are evaluated in a zero-shot setting. In contrast, Florence-2, a smaller model with 0.23 billion parameters, is optimized through full-parameter fine-tuning across three distinct experiments, each utilizing datasets augmented to different levels. The results show that Florence-2 achieves a 29.95% increase in precision, a 37.75% increase in recall, a 52.40% improvement in F1-score, and a 43.15% reduction in hallucination rate compared to the best-performing closed-source model. These findings highlight the effectiveness of fine-tuning smaller, open-source VLMs like Florence-2, offering a practical and efficient solution for automated GD&T extraction to support downstream manufacturing tasks.

Related papers

From Drawings to Decisions: A Hybrid Vision-Language Framework for Parsing 2D Engineering Drawings into Structured Manufacturing Knowledge [0.0]
Key information from 2D engineering drawings is essential for advancing digital manufacturing.<n>Manual extraction is slow and labor-intensive, while generic OCR models often fail due to complex layouts, engineering symbols, and rotated text.<n>We propose a hybrid vision-language framework that integrates a rotation-aware object detection model (YOLOv11-obb) with a transformer-based vision-language model (VLM)
arXiv Detail & Related papers (2025-06-20T17:10:01Z)
MiniCPM4: Ultra-Efficient LLMs on End Devices [124.73631357883228]
MiniCPM4 is a highly efficient large language model (LLM) designed explicitly for end-side devices.<n>We achieve this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems.<n>MiniCPM4 is available in two versions, with 0.5B and 8B parameters, respectively.
arXiv Detail & Related papers (2025-06-09T16:16:50Z)
EfficientLLM: Efficiency in Large Language Models [64.3537131208038]
Large Language Models (LLMs) have driven significant progress, yet their growing counts and context windows incur prohibitive compute, energy, and monetary costs.<n>We introduce EfficientLLM, a novel benchmark and the first comprehensive empirical study evaluating efficiency techniques for LLMs at scale.
arXiv Detail & Related papers (2025-05-20T02:27:08Z)
You Don't Need All Attentions: Distributed Dynamic Fine-Tuning for Foundation Models [13.234730313131054]
We introduce a novel Distributed Dynamic Fine-Tuning framework that orchestrates operations across attention modules. D2FT significantly reduces the computational workload required for fine-tuning foundation models. Results show that D2FT can be effectively extended to recent LoRA, a state-of-the-art parameter-efficient fine-tuning technique.
arXiv Detail & Related papers (2025-04-16T20:18:15Z)
EffOWT: Transfer Visual Language Models to Open-World Tracking Efficiently and Effectively [60.48750788231384]
Open-World Tracking (OWT) aims to track every object of any category, which requires the model to have strong generalization capabilities. EffOWT achieves an absolute gain of 5.5% on the tracking metric OWTA for unknown categories, while only updating 1.3% of the parameters compared to full fine-tuning.
arXiv Detail & Related papers (2025-04-07T14:47:58Z)
Automated, Unsupervised, and Auto-parameterized Inference of Data Patterns and Anomaly Detection [6.454528834218153]
RIOLU is fully automated, automatically parameterized, and does not need labeled samples. RIOLU can generate precise patterns from datasets in various domains, with a high F1 score of 97.2%. A variant of RIOLU, with user guidance, can further boost its precision, with up to 37.4% improvement in terms of F1.
arXiv Detail & Related papers (2024-12-06T18:18:26Z)
DELIFT: Data Efficient Language model Instruction Fine Tuning [13.538140114667772]
We introduce DELIFT, a novel algorithm that systematically optimize data selection across the three key stages of fine-tuning. Experiments across various tasks and model scales demonstrate that DELIFT can reduce the fine-tuning data size by up to 70% without compromising performance.
arXiv Detail & Related papers (2024-11-07T04:38:29Z)
Leveraging Vision-Language Models for Manufacturing Feature Recognition in CAD Designs [0.0]
This study investigates vision-language models (VLMs) for automating the recognition of a wide range of manufacturing features in CAD designs. prompt engineering techniques, such as multi-view query images, few-shot learning, sequential reasoning, and chain-of-thought, are applied to enable recognition.
arXiv Detail & Related papers (2024-11-05T04:57:55Z)
Crafting Efficient Fine-Tuning Strategies for Large Language Models [2.633490094119608]
Fine-tuning large language models (LLMs) with as few as 200 samples can improve model accuracy from 70% to 88% in a product attribute extraction task. A bayesian hyperparameter optimization method, which evaluates models at 20% of total training time, correlates strongly with final model performance. This approach led to a 2% improvement in accuracy over baseline models when evaluated on an independent test set.
arXiv Detail & Related papers (2024-07-18T21:36:00Z)
AutoFT: Learning an Objective for Robust Fine-Tuning [60.641186718253735]
Foundation models encode rich representations that can be adapted to downstream tasks by fine-tuning. Current approaches to robust fine-tuning use hand-crafted regularization techniques. We propose AutoFT, a data-driven approach for robust fine-tuning.
arXiv Detail & Related papers (2024-01-18T18:58:49Z)
VeCAF: Vision-language Collaborative Active Finetuning with Training Objective Awareness [56.87603097348203]
VeCAF uses labels and natural language annotations to perform parametric data selection for PVM finetuning. VeCAF incorporates the finetuning objective to select significant data points that effectively guide the PVM towards faster convergence. On ImageNet, VeCAF uses up to 3.3x less training batches to reach the target performance compared to full finetuning.
arXiv Detail & Related papers (2024-01-15T17:28:37Z)
When Parameter-efficient Tuning Meets General-purpose Vision-language Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique. Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z)
Defect Analysis of 3D Printed Cylinder Object Using Transfer Learning Approaches [0.51795041186793]
This study explores the effectiveness of machine learning approaches, specifically transfer learning (TL) models, for defect detection in 3D-printed cylinders. Images of cylinders were analyzed using models including VGG16, VGG19, ResNet50, ResNet101, InceptionResNetV2, and MobileNetV2. Results suggest certain TL models can deliver high accuracy for AM defect classification, although performance varies across algorithms.
arXiv Detail & Related papers (2023-10-12T18:10:36Z)
Scaling & Shifting Your Features: A New Baseline for Efficient Model Tuning [126.84770886628833]
Existing finetuning methods either tune all parameters of the pretrained model (full finetuning) or only tune the last linear layer (linear probing) We propose a new parameter-efficient finetuning method termed as SSF, representing that researchers only need to Scale and Shift the deep Features extracted by a pre-trained model to catch up with the performance full finetuning.
arXiv Detail & Related papers (2022-10-17T08:14:49Z)
FedDM: Iterative Distribution Matching for Communication-Efficient Federated Learning [87.08902493524556]
Federated learning(FL) has recently attracted increasing attention from academia and industry. We propose FedDM to build the global training objective from multiple local surrogate functions. In detail, we construct synthetic sets of data on each client to locally match the loss landscape from original data.
arXiv Detail & Related papers (2022-07-20T04:55:18Z)
Complementary Ensemble Learning [1.90365714903665]
We derive a technique to improve performance of state-of-the-art deep learning models. Specifically, we train auxiliary models which are able to complement state-of-the-art model uncertainty.
arXiv Detail & Related papers (2021-11-09T03:23:05Z)
DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language Models [152.29364079385635]
As pre-trained models grow bigger, the fine-tuning process can be time-consuming and computationally expensive. We propose a framework for resource- and parameter-efficient fine-tuning by leveraging the sparsity prior in both weight updates and the final model weights. Our proposed framework, dubbed Dually Sparsity-Embedded Efficient Tuning (DSEE), aims to achieve two key objectives: (i) parameter efficient fine-tuning and (ii) resource-efficient inference.
arXiv Detail & Related papers (2021-10-30T03:29:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.