Related papers: SynFinTabs: A Dataset of Synthetic Financial Tables for Information and Table Extraction

SynFinTabs: A Dataset of Synthetic Financial Tables for Information and Table Extraction

URL: http://arxiv.org/abs/2412.04262v1
Date: Thu, 05 Dec 2024 15:42:59 GMT
Title: SynFinTabs: A Dataset of Synthetic Financial Tables for Information and Table Extraction
Authors: Ethan Bradley, Muhammad Roman, Karen Rafferty, Barry Devereux,
Abstract summary: Existing datasets often focus on scientific tables due to the vast amount of academic articles.<n>Current datasets often lack the words, and their positions, contained within the tables.<n>We present SynFinTabs, a large-scale, labelled dataset of synthetic financial tables.
Score: 1.0624606551524207
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Table extraction from document images is a challenging AI problem, and labelled data for many content domains is difficult to come by. Existing table extraction datasets often focus on scientific tables due to the vast amount of academic articles that are readily available, along with their source code. However, there are significant layout and typographical differences between tables found across scientific, financial, and other domains. Current datasets often lack the words, and their positions, contained within the tables, instead relying on unreliable OCR to extract these features for training modern machine learning models on natural language processing tasks. Therefore, there is a need for a more general method of obtaining labelled data. We present SynFinTabs, a large-scale, labelled dataset of synthetic financial tables. Our hope is that our method of generating these synthetic tables is transferable to other domains. To demonstrate the effectiveness of our dataset in training models to extract information from table images, we create FinTabQA, a layout large language model trained on an extractive question-answering task. We test our model using real-world financial tables and compare it to a state-of-the-art generative model and discuss the results. We make the dataset, model, and dataset generation code publicly available.

Related papers

Table Understanding and (Multimodal) LLMs: A Cross-Domain Case Study on Scientific vs. Non-Scientific Data [12.56716294438794]
We investigate the effectiveness of both text-based and multimodal LLMs on table understanding tasks.<n>We compare their performance on tables from scientific vs. non-scientific contexts and examine their robustness on tables represented as images vs. text.
arXiv Detail & Related papers (2025-06-30T18:04:36Z)
LaTable: Towards Large Tabular Models [63.995130144110156]
Tabular generative foundation models are hard to build due to the heterogeneous feature spaces of different datasets. LaTable is a novel diffusion model that addresses these challenges and can be trained across different datasets. We find that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples.
arXiv Detail & Related papers (2024-06-25T16:03:50Z)
Synthesizing Realistic Data for Table Recognition [4.500373384879752]
We propose a novel method for synthesizing annotation data specifically designed for table recognition. By leveraging the structure and content of tables from Chinese financial announcements, we have developed the first extensive table annotation dataset. We have established the inaugural benchmark for real-world complex tables in the Chinese financial announcement domain, using it to assess the performance of models trained on our synthetic data.
arXiv Detail & Related papers (2024-04-17T06:36:17Z)
Large Language Model for Table Processing: A Survey [18.32332372134988]
This survey provides a comprehensive overview of table-related tasks. It covers traditional tasks like table question answering as well as emerging fields such as spreadsheet manipulation and table data analysis.
arXiv Detail & Related papers (2024-02-04T00:47:53Z)
TAP4LLM: Table Provider on Sampling, Augmenting, and Packing Semi-structured Data for Large Language Model Reasoning [55.33939289989238]
We propose TAP4LLM as a versatile pre-processor suite for leveraging large language models (LLMs) in table-based tasks effectively. It covers several distinct components: (1) table sampling to decompose large tables into manageable sub-tables based on query semantics, (2) table augmentation to enhance tables with additional knowledge from external sources or models, and (3) table packing & serialization to convert tables into various formats suitable for LLMs' understanding.
arXiv Detail & Related papers (2023-12-14T15:37:04Z)
Relational Deep Learning: Graph Representation Learning on Relational Databases [69.7008152388055]
We introduce an end-to-end representation approach to learn on data laid out across multiple tables. Message Passing Graph Neural Networks can then automatically learn across the graph to extract representations that leverage all data input.
arXiv Detail & Related papers (2023-12-07T18:51:41Z)
Schema-Driven Information Extraction from Heterogeneous Tables [37.50854811537401]
We present a benchmark comprised of tables from four diverse domains: machine learning papers, chemistry literature, material science journals, and webpages. Our experiments demonstrate that surprisingly competitive performance can be achieved without requiring task-specific pipelines or labels.
arXiv Detail & Related papers (2023-05-23T17:58:10Z)
OmniTab: Pretraining with Natural and Synthetic Data for Few-shot Table-based Question Answering [106.73213656603453]
We develop a simple table-based QA model with minimal annotation effort. We propose an omnivorous pretraining approach that consumes both natural and synthetic data.
arXiv Detail & Related papers (2022-07-08T01:23:45Z)
DiSCoMaT: Distantly Supervised Composition Extraction from Tables in Materials Science Articles [25.907266860321727]
We define a novel NLP task of extracting compositions of materials from tables in materials science papers. We release a training dataset comprising 4,408 distantly supervised tables, along with 1,475 manually annotated dev and test tables. We show that DISCOMAT outperforms recent table processing architectures by significant margins.
arXiv Detail & Related papers (2022-07-03T17:11:17Z)
TabLeX: A Benchmark Dataset for Structure and Content Information Extraction from Scientific Tables [1.4115224153549193]
This paper presents TabLeX, a large-scale benchmark dataset comprising table images generated from scientific articles. To facilitate the development of robust table IE tools, TabLeX contains images in different aspect ratios and in a variety of fonts. Our analysis sheds light on the shortcomings of current state-of-the-art table extraction models and shows that they fail on even simple table images.
arXiv Detail & Related papers (2021-05-12T05:13:38Z)
A Graph Representation of Semi-structured Data for Web Question Answering [96.46484690047491]
We propose a novel graph representation of Web tables and lists based on a systematic categorization of the components in semi-structured data as well as their relations. Our method improves F1 score by 3.90 points over the state-of-the-art baselines.
arXiv Detail & Related papers (2020-10-14T04:01:54Z)
GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing [117.98107557103877]
We present GraPPa, an effective pre-training approach for table semantic parsing. We construct synthetic question-pairs over high-free tables via a synchronous context-free grammar. To maintain the model's ability to represent real-world data, we also include masked language modeling.
arXiv Detail & Related papers (2020-09-29T08:17:58Z)
GFTE: Graph-based Financial Table Extraction [66.26206038522339]
In financial industry and many other fields, tables are often disclosed in unstructured digital files, e.g. Portable Document Format (PDF) and images. We publish a standard Chinese dataset named FinTab, which contains more than 1,600 financial tables of diverse kinds. We propose a novel graph-based convolutional network model named GFTE as a baseline for future comparison.
arXiv Detail & Related papers (2020-03-17T07:10:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.