Related papers: Multimodal Document Analytics for Banking Process Automation

Multimodal Document Analytics for Banking Process Automation

URL: http://arxiv.org/abs/2307.11845v2
Date: Sun, 26 Nov 2023 08:57:44 GMT
Title: Multimodal Document Analytics for Banking Process Automation
Authors: Christopher Gerling, Stefan Lessmann
Abstract summary: The paper contributes original empirical evidence on the effectiveness and efficiency of multi-model models for document processing in the banking business. It offers practical guidance on how to unlock this potential in day-to-day operations.
Score: 4.541582055558865
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Traditional banks face increasing competition from FinTechs in the rapidly evolving financial ecosystem. Raising operational efficiency is vital to address this challenge. Our study aims to improve the efficiency of document-intensive business processes in banking. To that end, we first review the landscape of business documents in the retail segment. Banking documents often contain text, layout, and visuals, suggesting that document analytics and process automation require more than plain natural language processing (NLP). To verify this and assess the incremental value of visual cues when processing business documents, we compare a recently proposed multimodal model called LayoutXLM to powerful text classifiers (e.g., BERT) and large language models (e.g., GPT) in a case study related to processing company register extracts. The results confirm that incorporating layout information in a model substantially increases its performance. Interestingly, we also observed that more than 75% of the best model performance (in terms of the F1 score) can be achieved with as little as 30% of the training data. This shows that the demand for data labeled data to set up a multi-modal model can be moderate, which simplifies real-world applications of multimodal document analytics. Our study also sheds light on more specific practices in the scope of calibrating a multimodal banking document classifier, including the need for fine-tuning. In sum, the paper contributes original empirical evidence on the effectiveness and efficiency of multi-model models for document processing in the banking business and offers practical guidance on how to unlock this potential in day-to-day operations.

Related papers

Memory-Augmented Agent Training for Business Document Understanding [16.143076522786803]
We introduce Matrix (Memory-Augmented agent Training through Reasoning and Iterative eXploration), a novel paradigm that enables LLM agents to progressively build domain expertise. We collaborate with one of the world's largest logistics companies to create a dataset of Universal Business Language format invoice documents. Experiments demonstrate that Matrix outperforms prompting a single LLM by 30.3%, vanilla LLM agent by 35.2%.
arXiv Detail & Related papers (2024-12-17T18:35:04Z)
VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation [100.06122876025063]
This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings. We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG.
arXiv Detail & Related papers (2024-12-14T06:24:55Z)
Multi-modal Retrieval Augmented Multi-modal Generation: A Benchmark, Evaluate Metrics and Strong Baselines [63.427721165404634]
This paper investigates an intriguing task of Multi-modal Retrieval Augmented Multi-modal Generation (M$2$RAG) This task requires foundation models to browse multi-modal web pages, with mixed text and images, and generate multi-modal responses for solving user queries. We construct a benchmark for M$2$RAG task, equipped with a suite of text-modal metrics and multi-modal metrics to analyze the capabilities of existing foundation models.
arXiv Detail & Related papers (2024-11-25T13:20:19Z)
M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework [75.95430061891828]
We introduce M-LongDoc, a benchmark of 851 samples, and an automated framework to evaluate the performance of large multimodal models. We propose a retrieval-aware tuning approach for efficient and effective multimodal document reading.
arXiv Detail & Related papers (2024-11-09T13:30:38Z)
MetaSumPerceiver: Multimodal Multi-Document Evidence Summarization for Fact-Checking [0.283600654802951]
We present a summarization model designed to generate claim-specific summaries useful for fact-checking from multimodal datasets. We introduce a dynamic perceiver-based model that can handle inputs from multiple modalities of arbitrary lengths. Our approach outperforms the SOTA approach by 4.6% in the claim verification task on the MOCHEG dataset.
arXiv Detail & Related papers (2024-07-18T01:33:20Z)
LongFin: A Multimodal Document Understanding Model for Long Financial Domain Documents [4.924255992661131]
We introduce LongFin, a multimodal document AI model capable of encoding up to 4K tokens. We also propose the LongForms dataset that encapsulates several industrial challenges in financial documents.
arXiv Detail & Related papers (2024-01-26T18:23:45Z)
On Task-personalized Multimodal Few-shot Learning for Visually-rich Document Entity Retrieval [59.25292920967197]
Few-shot document entity retrieval (VDER) is an important topic in industrial NLP applications. FewVEX is a new dataset to boost future research in the field of entity-level few-shot VDER. We present a task-aware meta-learning based framework, with a central focus on achieving effective task personalization.
arXiv Detail & Related papers (2023-11-01T17:51:43Z)
Peek Across: Improving Multi-Document Modeling via Cross-Document Question-Answering [49.85790367128085]
We pre-training a generic multi-document model from a novel cross-document question answering pre-training objective. This novel multi-document QA formulation directs the model to better recover cross-text informational relations. Unlike prior multi-document models that focus on either classification or summarization tasks, our pre-training objective formulation enables the model to perform tasks that involve both short text generation and long text generation.
arXiv Detail & Related papers (2023-05-24T17:48:40Z)
FETILDA: An Effective Framework For Fin-tuned Embeddings For Long Financial Text Documents [14.269860621624394]
We propose and implement a deep learning framework that splits long documents into chunks and utilize pre-trained LMs to process and aggregate the chunks into vector representations. We evaluate our framework on a collection of 10-K public disclosure reports from US banks, and another dataset of reports submitted by US companies.
arXiv Detail & Related papers (2022-06-14T16:14:14Z)
Data-Efficient Information Extraction from Form-Like Documents [14.567098292973075]
Key challenge is that form-like documents can be laid out in virtually infinitely many ways. Data efficiency is critical to enable information extraction systems to scale to handle hundreds of different document-types.
arXiv Detail & Related papers (2022-01-07T19:16:49Z)
Single-Modal Entropy based Active Learning for Visual Question Answering [75.1682163844354]
We address Active Learning in the multi-modal setting of Visual Question Answering (VQA) In light of the multi-modal inputs, image and question, we propose a novel method for effective sample acquisition. Our novel idea is simple to implement, cost-efficient, and readily adaptable to other multi-modal tasks.
arXiv Detail & Related papers (2021-10-21T05:38:45Z)
An Intelligent Hybrid Model for Identity Document Classification [0.0]
Digitization may provide opportunities (e.g., increase in productivity, disaster recovery, and environmentally friendly solutions) and challenges for businesses. One of the main challenges would be to accurately classify numerous scanned documents uploaded every day by customers. There are not many studies available to address the challenge as an application of image classification. The proposed approach has been implemented using Python and experimentally validated on synthetic and real-world datasets.
arXiv Detail & Related papers (2021-06-07T13:08:00Z)
Explaining Relationships Between Scientific Documents [55.23390424044378]
We address the task of explaining relationships between two scientific documents using natural language text. In this paper we establish a dataset of 622K examples from 154K documents.
arXiv Detail & Related papers (2020-02-02T03:54:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.