Synthesizing Realistic Data for Table Recognition
        - URL: http://arxiv.org/abs/2404.11100v2
- Date: Tue, 9 Jul 2024 12:09:32 GMT
- Title: Synthesizing Realistic Data for Table Recognition
- Authors: Qiyu Hou, Jun Wang, Meixuan Qiao, Lujun Tian, 
- Abstract summary: We propose a novel method for synthesizing annotation data specifically designed for table recognition.
By leveraging the structure and content of tables from Chinese financial announcements, we have developed the first extensive table annotation dataset.
We have established the inaugural benchmark for real-world complex tables in the Chinese financial announcement domain, using it to assess the performance of models trained on our synthetic data.
- Score: 4.500373384879752
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   To overcome the limitations and challenges of current automatic table data annotation methods and random table data synthesis approaches, we propose a novel method for synthesizing annotation data specifically designed for table recognition. This method utilizes the structure and content of existing complex tables, facilitating the efficient creation of tables that closely replicate the authentic styles found in the target domain. By leveraging the actual structure and content of tables from Chinese financial announcements, we have developed the first extensive table annotation dataset in this domain. We used this dataset to train several recent deep learning-based end-to-end table recognition models. Additionally, we have established the inaugural benchmark for real-world complex tables in the Chinese financial announcement domain, using it to assess the performance of models trained on our synthetic data, thereby effectively validating our method's practicality and effectiveness. Furthermore, we applied our synthesis method to augment the FinTabNet dataset, extracted from English financial announcements, by increasing the proportion of tables with multiple spanning cells to introduce greater complexity. Our experiments show that models trained on this augmented dataset achieve comprehensive improvements in performance, especially in the recognition of tables with multiple spanning cells. 
 
      
        Related papers
        - Generating Synthetic Relational Tabular Data via Structural Causal   Models [0.0]
 We develop a novel framework that generates realistic synthetic relational data including causal relationships across tables.<n>Our experiments confirm that this framework is able to construct relational datasets with complex inter-table dependencies mimicking real-world scenarios.
 arXiv  Detail & Related papers  (2025-07-04T12:27:23Z)
- RelDiff: Relational Data Generative Modeling with Graph-Based Diffusion   Models [83.6013616017646]
 RelDiff is a novel diffusion generative model that synthesizes complete relational databases by explicitly modeling their foreign key graph structure.<n>RelDiff consistently outperforms prior methods in producing realistic and coherent synthetic relational databases.
 arXiv  Detail & Related papers  (2025-05-31T21:01:02Z)
- LLM-TabFlow: Synthetic Tabular Data Generation with Inter-column Logical   Relationship Preservation [49.898152180805454]
 This study is the first to explicitly address inter-column relationship preservation in synthetic tabular data generation.
LLM-TabFlow is a novel approach that captures complex inter-column relationships and compress data, while using Score-based Diffusion to model the distribution of the compressed data in latent space.
Our results show that LLM-TabFlow outperforms all baselines, fully preserving inter-column relationships while achieving the best balance between data fidelity, utility, and privacy.
 arXiv  Detail & Related papers  (2025-03-04T00:47:52Z)
- Better Think with Tables: Tabular Structures Enhance LLM Comprehension   for Data-Analytics Requests [33.471112091886894]
 Large Language Models (LLMs) often struggle with data-analytics requests related to information retrieval and data manipulation.<n>We introduce Thinking with Tables, where we inject tabular structures into LLMs for data-analytics requests.<n>We show that providing tables yields a 40.29 percent average performance gain along with better manipulation and token efficiency.
 arXiv  Detail & Related papers  (2024-12-22T23:31:03Z)
- SynFinTabs: A Dataset of Synthetic Financial Tables for Information and   Table Extraction [1.0624606551524207]
 Existing datasets often focus on scientific tables due to the vast amount of academic articles.
Current datasets often lack the words, and their positions, contained within the tables.
We present SynFinTabs, a large-scale, labelled dataset of synthetic financial tables.
 arXiv  Detail & Related papers  (2024-12-05T15:42:59Z)
- Tree-of-Table: Unleashing the Power of LLMs for Enhanced Large-Scale   Table Understanding [42.841205217768106]
 "Tree-of-Table" is a novel approach designed to enhance LLMs' reasoning capabilities over large and complex tables.
We show that Tree-of-Table sets a new benchmark with superior performance, showcasing remarkable efficiency and generalization capabilities in large-scale table reasoning.
 arXiv  Detail & Related papers  (2024-11-13T11:02:04Z)
- Enhancing Table Representations with LLM-powered Synthetic Data   Generation [0.565395466029518]
 We formulate a clear definition of table similarity in the context of data transformation activities within data-driven enterprises.
We propose a novel synthetic data generation pipeline that harnesses the code generation and data manipulation capabilities of Large Language Models.
We demonstrate that the synthetic data generated by our pipeline aligns with our proposed definition of table similarity and significantly enhances table representations.
 arXiv  Detail & Related papers  (2024-11-04T19:54:07Z)
- TableRAG: Million-Token Table Understanding with Language Models [53.039560091592215]
 TableRAG is a Retrieval-Augmented Generation (RAG) framework specifically designed for LM-based table understanding.
TableRAG leverages query expansion combined with schema and cell retrieval to pinpoint crucial information before providing it to the LMs.
Our results demonstrate that TableRAG achieves the highest retrieval quality, leading to the new state-of-the-art performance on large-scale table understanding.
 arXiv  Detail & Related papers  (2024-10-07T04:15:02Z)
- UniTabNet: Bridging Vision and Language Models for Enhanced Table   Structure Recognition [55.153629718464565]
 We introduce UniTabNet, a novel framework for table structure parsing based on the image-to-text model.
UniTabNet employs a divide-and-conquer'' strategy, utilizing an image-to-text model to decouple table cells and integrating both physical and logical decoders to reconstruct the complete table structure.
 arXiv  Detail & Related papers  (2024-09-20T01:26:32Z)
- Latent Diffusion for Guided Document Table Generation [4.891597567642704]
 This research paper introduces a novel approach for generating annotated images for table structure.
The proposed method aims to enhance the quality of synthetic data used for training object detection models.
 Experimental results demonstrate that the introduced approach significantly improves the quality of synthetic data for training.
 arXiv  Detail & Related papers  (2024-08-19T08:46:16Z)
- Wiki-TabNER:Advancing Table Interpretation Through Named Entity
  Recognition [19.423556742293762]
 We analyse a widely used benchmark dataset for evaluation of TI tasks.
To overcome this drawback, we construct and annotate a new more challenging dataset.
We propose a prompting framework for evaluating the newly developed large language models.
 arXiv  Detail & Related papers  (2024-03-07T15:22:07Z)
- TAP4LLM: Table Provider on Sampling, Augmenting, and Packing   Semi-structured Data for Large Language Model Reasoning [55.33939289989238]
 We propose TAP4LLM as a versatile pre-processor suite for leveraging large language models (LLMs) in table-based tasks effectively.
It covers several distinct components: (1) table sampling to decompose large tables into manageable sub-tables based on query semantics, (2) table augmentation to enhance tables with additional knowledge from external sources or models, and (3) table packing & serialization to convert tables into various formats suitable for LLMs' understanding.
 arXiv  Detail & Related papers  (2023-12-14T15:37:04Z)
- Privately generating tabular data using language models [80.67328256105891]
 Privately generating synthetic data from a table is an important brick of a privacy-first world.
We propose and investigate a simple approach of treating each row in a table as a sentence and training a language model with differential privacy.
 arXiv  Detail & Related papers  (2023-06-07T21:53:14Z)
- TCN: Table Convolutional Network for Web Table Interpretation [52.32515851633981]
 We propose a novel table representation learning approach considering both the intra- and inter-table contextual information.
Our method can outperform competitive baselines by +4.8% of F1 for column type prediction and by +4.1% of F1 for column pairwise relation prediction.
 arXiv  Detail & Related papers  (2021-02-17T02:18:10Z)
- GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing [117.98107557103877]
 We present GraPPa, an effective pre-training approach for table semantic parsing.
We construct synthetic question-pairs over high-free tables via a synchronous context-free grammar.
To maintain the model's ability to represent real-world data, we also include masked language modeling.
 arXiv  Detail & Related papers  (2020-09-29T08:17:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.