A Multimodal Pipeline for Clinical Data Extraction: Applying Vision-Language Models to Scans of Transfusion Reaction Reports
- URL: http://arxiv.org/abs/2504.20220v1
- Date: Mon, 28 Apr 2025 19:40:28 GMT
- Title: A Multimodal Pipeline for Clinical Data Extraction: Applying Vision-Language Models to Scans of Transfusion Reaction Reports
- Authors: Henning Schäfer, Cynthia S. Schmidt, Johannes Wutzkowsky, Kamil Lorek, Lea Reinartz, Johannes Rückert, Christian Temme, Britta Böckmann, Peter A. Horn, Christoph M. Friedrich,
- Abstract summary: This study presents an open-source pipeline that extracts and categorizes checkbox data from scanned documents.<n>The pipeline achieves high precision and recall compared against annually compiled gold-standards from 2017 to 2024.
- Score: 0.3552186988607578
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the growing adoption of electronic health records, many processes still rely on paper documents, reflecting the heterogeneous real-world conditions in which healthcare is delivered. The manual transcription process is time-consuming and prone to errors when transferring paper-based data to digital formats. To streamline this workflow, this study presents an open-source pipeline that extracts and categorizes checkbox data from scanned documents. Demonstrated on transfusion reaction reports, the design supports adaptation to other checkbox-rich document types. The proposed method integrates checkbox detection, multilingual optical character recognition (OCR) and multilingual vision-language models (VLMs). The pipeline achieves high precision and recall compared against annually compiled gold-standards from 2017 to 2024. The result is a reduction in administrative workload and accurate regulatory reporting. The open-source availability of this pipeline encourages self-hosted parsing of checkbox forms.
Related papers
- Retrieval or Representation? Reassessing Benchmark Gaps in Multilingual and Visually Rich RAG [1.4425299138308667]
BM25 rank documents by term overlap with corpus-level weighting.<n>End-to-end multimodal retrievers trained on large query-document datasets claim substantial improvements over these approaches.<n>We demonstrate that better document representation is the primary driver of benchmark improvements.
arXiv Detail & Related papers (2026-03-04T16:21:20Z) - Training-Free Acceleration for Document Parsing Vision-Language Model with Hierarchical Speculative Decoding [102.88996030431662]
We propose a training-free and highly efficient acceleration method for document parsing tasks.<n>Inspired by speculative decoding, we employ a lightweight document parsing pipeline as a draft model to predict batches of future tokens.<n>We demonstrate the effectiveness of our approach on the general-purpose OmniDocBench.
arXiv Detail & Related papers (2026-02-13T14:22:10Z) - A Hybrid Architecture for Multi-Stage Claim Document Understanding: Combining Vision-Language Models and Machine Learning for Real-Time Processing [0.0]
Claims documents are fundamental to healthcare and insurance operations, serving as the basis for reimbursement, auditing, and compliance.<n>This paper presents a robust multi-stage pipeline that integrates the multilingual optical character recognition (OCR) engine PaddleOCR, a traditional Logistic Regression, and a compact Vision-Language Model (VLM), Qwen 2.5-VL-7B.<n>The proposed system achieves a document-type classification accuracy of over 95 percent and a field-level extraction accuracy of approximately 87 percent, while maintaining an average processing latency of under 2 seconds per document.
arXiv Detail & Related papers (2026-01-05T08:40:44Z) - Document Data Matching for Blockchain-Supported Real Estate [2.9873162504735133]
This work presents a system that integrates optical character recognition (OCR), natural language processing (NLP), and verifiable credentials (VCs) to automate document extraction, verification, and management.<n>The approach standardizes heterogeneous document formats into VCs and applies automated data matching to detect inconsistencies, while the blockchain provides a decentralized trust layer that reinforces transparency and integrity.<n>The proposed framework demonstrates the potential to streamline real estate transactions, strengthen stakeholder trust, and enable scalable, secure digital processes.
arXiv Detail & Related papers (2025-12-30T20:30:48Z) - MedDCR: Learning to Design Agentic Workflows for Medical Coding [55.51674334874892]
Medical coding converts free-text clinical notes into standardized diagnostic and procedural codes.<n>We present MedDCR, a closed-loop framework that treats design as a learning problem.<n>On benchmark datasets, MedDCR outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2025-11-17T13:30:51Z) - Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing [37.052999707460636]
layoutRL is an end-to-end reinforcement learning framework that trains models to be explicitly layout-aware.<n>We will publicly release our code and dataset to accelerate progress in robust document understanding.
arXiv Detail & Related papers (2025-06-01T15:19:52Z) - DocSpiral: A Platform for Integrated Assistive Document Annotation through Human-in-the-Spiral [11.336757553731639]
Acquiring structured data from domain-specific, image-based documents is crucial for many downstream tasks.<n>Many documents exist as images rather than as machine-readable text, which requires human annotation to train automated extraction systems.<n>We present DocSpiral, the first Human-in-the-Spiral assistive document annotation platform.
arXiv Detail & Related papers (2025-05-06T06:02:42Z) - An Efficient Deep Learning-Based Approach to Automating Invoice Document Validation [0.0]
We propose to automate the validation of machine written invoices using document layout analysis and object detection techniques.<n>We introduce a novel dataset consisting of manually annotated real-world invoices and a multi-criteria validation process.
arXiv Detail & Related papers (2025-03-15T21:33:00Z) - Learning Refined Document Representations for Dense Retrieval via Deliberate Thinking [58.69615583599489]
Deliberate Thinking based Retriever (Debater) is a novel approach that enhances document representations by incorporating a step-by-step thinking process.<n>Debater significantly outperforms existing methods across several retrieval benchmarks.
arXiv Detail & Related papers (2025-02-18T15:56:34Z) - Information Extraction from Heterogeneous Documents without Ground Truth Labels using Synthetic Label Generation and Knowledge Distillation [0.2302001830524133]
We propose Task Aware Instruction-based Labelling (TAIL), a method for synthetic label generation in VRD corpuses without labels.
We fine-tune a multimodal Visually Rich Document Understanding Model (VRDU) on TAIL labels using response-based knowledge distillation.
We show that the resulting model performs at par or better on the internal expense documents of a large multinational organization than state-of-the-art LMM.
arXiv Detail & Related papers (2024-11-22T14:16:09Z) - Give Me More Details: Improving Fact-Checking with Latent Retrieval [58.706972228039604]
Evidence plays a crucial role in automated fact-checking.
Existing fact-checking systems either assume the evidence sentences are given or use the search snippets returned by the search engine.
We propose to incorporate full text from source documents as evidence and introduce two enriched datasets.
arXiv Detail & Related papers (2023-05-25T15:01:19Z) - Document Flattening: Beyond Concatenating Context for Document-Level
Neural Machine Translation [45.56189820979461]
Document Flattening (DocFlat) technique integrates Flat-Batch Attention (FB) and Neural Context Gate (NCG) into Transformer model.
We conduct comprehensive experiments and analyses on three benchmark datasets for English-German translation.
arXiv Detail & Related papers (2023-02-16T04:38:34Z) - Generate rather than Retrieve: Large Language Models are Strong Context
Generators [74.87021992611672]
We present a novel perspective for solving knowledge-intensive tasks by replacing document retrievers with large language model generators.
We call our method generate-then-read (GenRead), which first prompts a large language model to generate contextutal documents based on a given question, and then reads the generated documents to produce the final answer.
arXiv Detail & Related papers (2022-09-21T01:30:59Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - GERE: Generative Evidence Retrieval for Fact Verification [57.78768817972026]
We propose GERE, the first system that retrieves evidences in a generative fashion.
The experimental results on the FEVER dataset show that GERE achieves significant improvements over the state-of-the-art baselines.
arXiv Detail & Related papers (2022-04-12T03:49:35Z) - DocScanner: Robust Document Image Rectification with Progressive
Learning [162.03694280524084]
This work presents DocScanner, a new deep network architecture for document image rectification.
DocScanner maintains a single estimate of the rectified image, which is progressively corrected with a recurrent architecture.
The iterative refinements make DocScanner converge to a robust and superior performance, and the lightweight recurrent architecture ensures the running efficiency.
arXiv Detail & Related papers (2021-10-28T09:15:02Z) - Automated Generation of Accurate \& Fluent Medical X-ray Reports [17.927768992248172]
The paper focuses on automating the generation of medical reports from chest X-ray image inputs.
Our approach achieved promising results on commonly-used metrics concerning language fluency and clinical accuracy.
arXiv Detail & Related papers (2021-08-27T05:47:28Z) - Learning Contextualized Document Representations for Healthcare Answer
Retrieval [68.02029435111193]
Contextual Discourse Vectors (CDV) is a distributed document representation for efficient answer retrieval from long documents.
Our model leverages a dual encoder architecture with hierarchical LSTM layers and multi-task training to encode the position of clinical entities and aspects alongside the document discourse.
We show that our generalized model significantly outperforms several state-of-the-art baselines for healthcare passage ranking.
arXiv Detail & Related papers (2020-02-03T15:47:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.