Multimodal Document Analytics for Banking Process Automation
- URL: http://arxiv.org/abs/2307.11845v2
- Date: Sun, 26 Nov 2023 08:57:44 GMT
- Title: Multimodal Document Analytics for Banking Process Automation
- Authors: Christopher Gerling, Stefan Lessmann
- Abstract summary: The paper contributes original empirical evidence on the effectiveness and efficiency of multi-model models for document processing in the banking business.
It offers practical guidance on how to unlock this potential in day-to-day operations.
- Score: 4.541582055558865
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Traditional banks face increasing competition from FinTechs in the rapidly
evolving financial ecosystem. Raising operational efficiency is vital to
address this challenge. Our study aims to improve the efficiency of
document-intensive business processes in banking. To that end, we first review
the landscape of business documents in the retail segment. Banking documents
often contain text, layout, and visuals, suggesting that document analytics and
process automation require more than plain natural language processing (NLP).
To verify this and assess the incremental value of visual cues when processing
business documents, we compare a recently proposed multimodal model called
LayoutXLM to powerful text classifiers (e.g., BERT) and large language models
(e.g., GPT) in a case study related to processing company register extracts.
The results confirm that incorporating layout information in a model
substantially increases its performance. Interestingly, we also observed that
more than 75% of the best model performance (in terms of the F1 score) can be
achieved with as little as 30% of the training data. This shows that the demand
for data labeled data to set up a multi-modal model can be moderate, which
simplifies real-world applications of multimodal document analytics. Our study
also sheds light on more specific practices in the scope of calibrating a
multimodal banking document classifier, including the need for fine-tuning. In
sum, the paper contributes original empirical evidence on the effectiveness and
efficiency of multi-model models for document processing in the banking
business and offers practical guidance on how to unlock this potential in
day-to-day operations.
Related papers
- Memory-Augmented Agent Training for Business Document Understanding [16.143076522786803]
We introduce Matrix (Memory-Augmented agent Training through Reasoning and Iterative eXploration), a novel paradigm that enables LLM agents to progressively build domain expertise.
We collaborate with one of the world's largest logistics companies to create a dataset of Universal Business Language format invoice documents.
Experiments demonstrate that Matrix outperforms prompting a single LLM by 30.3%, vanilla LLM agent by 35.2%.
arXiv Detail & Related papers (2024-12-17T18:35:04Z) - VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation [100.06122876025063]
This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings.
We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG.
arXiv Detail & Related papers (2024-12-14T06:24:55Z) - Multi-modal Retrieval Augmented Multi-modal Generation: Datasets, Evaluation Metrics and Strong Baselines [64.61315565501681]
Multi-modal Retrieval Augmented Multi-modal Generation (M$2$RAG) is a novel task that enables foundation models to process multi-modal web content.
Despite its potential impact, M$2$RAG remains understudied, lacking comprehensive analysis and high-quality data resources.
arXiv Detail & Related papers (2024-11-25T13:20:19Z) - M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework [75.95430061891828]
We introduce M-LongDoc, a benchmark of 851 samples, and an automated framework to evaluate the performance of large multimodal models.
We propose a retrieval-aware tuning approach for efficient and effective multimodal document reading.
arXiv Detail & Related papers (2024-11-09T13:30:38Z) - MetaSumPerceiver: Multimodal Multi-Document Evidence Summarization for Fact-Checking [0.283600654802951]
We present a summarization model designed to generate claim-specific summaries useful for fact-checking from multimodal datasets.
We introduce a dynamic perceiver-based model that can handle inputs from multiple modalities of arbitrary lengths.
Our approach outperforms the SOTA approach by 4.6% in the claim verification task on the MOCHEG dataset.
arXiv Detail & Related papers (2024-07-18T01:33:20Z) - LongFin: A Multimodal Document Understanding Model for Long Financial
Domain Documents [4.924255992661131]
We introduce LongFin, a multimodal document AI model capable of encoding up to 4K tokens.
We also propose the LongForms dataset that encapsulates several industrial challenges in financial documents.
arXiv Detail & Related papers (2024-01-26T18:23:45Z) - On Task-personalized Multimodal Few-shot Learning for Visually-rich
Document Entity Retrieval [59.25292920967197]
Few-shot document entity retrieval (VDER) is an important topic in industrial NLP applications.
FewVEX is a new dataset to boost future research in the field of entity-level few-shot VDER.
We present a task-aware meta-learning based framework, with a central focus on achieving effective task personalization.
arXiv Detail & Related papers (2023-11-01T17:51:43Z) - Peek Across: Improving Multi-Document Modeling via Cross-Document
Question-Answering [49.85790367128085]
We pre-training a generic multi-document model from a novel cross-document question answering pre-training objective.
This novel multi-document QA formulation directs the model to better recover cross-text informational relations.
Unlike prior multi-document models that focus on either classification or summarization tasks, our pre-training objective formulation enables the model to perform tasks that involve both short text generation and long text generation.
arXiv Detail & Related papers (2023-05-24T17:48:40Z) - FETILDA: An Effective Framework For Fin-tuned Embeddings For Long
Financial Text Documents [14.269860621624394]
We propose and implement a deep learning framework that splits long documents into chunks and utilize pre-trained LMs to process and aggregate the chunks into vector representations.
We evaluate our framework on a collection of 10-K public disclosure reports from US banks, and another dataset of reports submitted by US companies.
arXiv Detail & Related papers (2022-06-14T16:14:14Z) - Data-Efficient Information Extraction from Form-Like Documents [14.567098292973075]
Key challenge is that form-like documents can be laid out in virtually infinitely many ways.
Data efficiency is critical to enable information extraction systems to scale to handle hundreds of different document-types.
arXiv Detail & Related papers (2022-01-07T19:16:49Z) - Single-Modal Entropy based Active Learning for Visual Question Answering [75.1682163844354]
We address Active Learning in the multi-modal setting of Visual Question Answering (VQA)
In light of the multi-modal inputs, image and question, we propose a novel method for effective sample acquisition.
Our novel idea is simple to implement, cost-efficient, and readily adaptable to other multi-modal tasks.
arXiv Detail & Related papers (2021-10-21T05:38:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.