Structured Extraction from Business Process Diagrams Using Vision-Language Models
- URL: http://arxiv.org/abs/2511.22448v1
- Date: Thu, 27 Nov 2025 13:35:57 GMT
- Title: Structured Extraction from Business Process Diagrams Using Vision-Language Models
- Authors: Pritam Deka, Barry Devereux,
- Abstract summary: We present Vision-Language Models (VLMs) to extract structured representations of BPMN diagrams directly from images.<n>We also incorporate optical character recognition (OCR) for textual enrichment and evaluate the generated element lists.<n>Our approach enables robust component extraction in scenarios where original source files are unavailable.
- Score: 0.3007949058551534
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Business Process Model and Notation (BPMN) is a widely adopted standard for representing complex business workflows. While BPMN diagrams are often exchanged as visual images, existing methods primarily rely on XML representations for computational analysis. In this work, we present a pipeline that leverages Vision-Language Models (VLMs) to extract structured JSON representations of BPMN diagrams directly from images, without requiring source model files or textual annotations. We also incorporate optical character recognition (OCR) for textual enrichment and evaluate the generated element lists against ground truth data derived from the source XML files. Our approach enables robust component extraction in scenarios where original source files are unavailable. We benchmark multiple VLMs and observe performance improvements in several models when OCR is used for text enrichment. In addition, we conducted extensive statistical analyses of OCR-based enrichment methods and prompt ablation studies, providing a clearer understanding of their impact on model performance.
Related papers
- Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents [99.62178668680578]
We propose Vision-Centric Contrastive Learning (VC2L), a unified framework that models text, images, and their combinations using a single vision transformer.<n> VC2L operates entirely in pixel space by rendering all inputs, whether textual, visual, or combined, as images.<n>To capture complex cross-modal relationships in web documents, VC2L employs a snippet-level contrastive learning objective that aligns consecutive multimodal segments.
arXiv Detail & Related papers (2025-10-21T14:59:29Z) - Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding [61.36285696607487]
Document understanding is critical for applications from financial analysis to scientific discovery.<n>Current approaches, whether OCR-based pipelines feeding Large Language Models (LLMs) or native Multimodal LLMs (MLLMs) face key limitations.<n>Retrieval-Augmented Generation (RAG) helps ground models in external data, but documents' multimodal nature, combining text, tables, charts, and layout, demands a more advanced paradigm: Multimodal RAG.
arXiv Detail & Related papers (2025-10-17T02:33:16Z) - Leveraging Machine Learning and Enhanced Parallelism Detection for BPMN Model Generation from Text [75.77648333476776]
This paper introduces an automated pipeline for extracting BPMN models from text.<n>A key contribution of this work is the introduction of a newly annotated dataset.<n>We augment the dataset with 15 newly annotated documents containing 32 parallel gateways for model training.
arXiv Detail & Related papers (2025-07-11T07:25:55Z) - An analysis of vision-language models for fabric retrieval [4.311804611758908]
Cross-modal retrieval is essential for applications like information retrieval and recommendation systems.<n>This paper investigates the use of Vision Language Models for zero-shot text-to-image retrieval on fabric samples.
arXiv Detail & Related papers (2025-07-07T08:00:18Z) - SCAN: Semantic Document Layout Analysis for Textual and Visual Retrieval-Augmented Generation [5.458935851230595]
We present SCAN, a novel approach enhancing both textual and visual Retrieval-Augmented Generation (RAG) systems.<n>SCAN uses a coarse-grained semantic approach that divides documents into coherent regions covering continuous components.<n>Our experimental results across English and Japanese datasets demonstrate that applying SCAN improves end-to-end textual RAG performance by up to 9.0% and visual RAG performance by up to 6.4%.
arXiv Detail & Related papers (2025-05-20T14:03:24Z) - QID: Efficient Query-Informed ViTs in Data-Scarce Regimes for OCR-free Visual Document Understanding [53.69841526266547]
Fine-tuning a pre-trained Vision-Language Model with new datasets often falls short in optimizing the vision encoder.<n>We introduce QID, a novel, streamlined, architecture-preserving approach that integrates query embeddings into the vision encoder.
arXiv Detail & Related papers (2025-04-03T18:47:16Z) - Overcoming Vision Language Model Challenges in Diagram Understanding: A Proof-of-Concept with XML-Driven Large Language Models Solutions [0.0]
Diagrams play crucial role in visually conveying complex relationships and processes within business documentation.<n>Despite recent advances in Vision-Language Models (VLMs) for various image understanding tasks, accurately identifying and extracting structures in diagrams continues to pose significant challenges.<n>This study proposes a text-driven approach that bypasses reliance on VLMs' visual recognition capabilities.
arXiv Detail & Related papers (2025-02-05T23:40:26Z) - CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models [58.95889895912716]
We introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension.
Our findings indicate that MLLMs consistently fall short of human performance on this benchmark.
This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner.
arXiv Detail & Related papers (2024-02-21T08:21:12Z) - From Dialogue to Diagram: Task and Relationship Extraction from Natural
Language for Accelerated Business Process Prototyping [0.0]
This paper introduces a contemporary solution, where central to our approach, is the use of dependency parsing and Named Entity Recognition (NER)
We utilize Subject-Verb-Object (SVO) constructs for identifying action relationships and integrate semantic analysis tools, including WordNet, for enriched contextual understanding.
The system adeptly handles data transformation and visualization, converting verbose extracted information into BPMN (Business Process Model and Notation) diagrams.
arXiv Detail & Related papers (2023-12-16T12:35:28Z) - Named Entity and Relation Extraction with Multi-Modal Retrieval [51.660650522630526]
Multi-modal named entity recognition (NER) and relation extraction (RE) aim to leverage relevant image information to improve the performance of NER and RE.
We propose a novel Multi-modal Retrieval based framework (MoRe)
MoRe contains a text retrieval module and an image-based retrieval module, which retrieve related knowledge of the input text and image in the knowledge corpus respectively.
arXiv Detail & Related papers (2022-12-03T13:11:32Z) - VSR: A Unified Framework for Document Layout Analysis combining Vision,
Semantics and Relations [40.721146438291335]
We propose a unified framework VSR for document layout analysis, combining vision, semantics and relations.
On three popular benchmarks, VSR outperforms previous models by large margins.
arXiv Detail & Related papers (2021-05-13T12:20:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.