Related papers: Towards Automatic Parsing of Structured Visual Content through the Use of Synthetic Data

Towards Automatic Parsing of Structured Visual Content through the Use of Synthetic Data

URL: http://arxiv.org/abs/2204.14136v1
Date: Fri, 29 Apr 2022 14:44:52 GMT
Title: Towards Automatic Parsing of Structured Visual Content through the Use of Synthetic Data
Authors: Lukas Scholch, Jonas Steinhauser, Maximilian Beichter, Constantin Seibold, Kailun Yang, Merlin Kn\"able, Thorsten Schwarz, Alexander M\"adche, and Rainer Stiefelhagen
Abstract summary: We propose a synthetic dataset, containing Structured Visual Content (SVCs) in the form of images and ground truths. We show the usage of this dataset by an application that automatically extracts a graph representation from an SVC image. Our dataset enables the development of strong models for the interpretation of SVCs while skipping the time-consuming dense data annotation.
Score: 65.68384124394699
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Structured Visual Content (SVC) such as graphs, flow charts, or the like are used by authors to illustrate various concepts. While such depictions allow the average reader to better understand the contents, images containing SVCs are typically not machine-readable. This, in turn, not only hinders automated knowledge aggregation, but also the perception of displayed in-formation for visually impaired people. In this work, we propose a synthetic dataset, containing SVCs in the form of images as well as ground truths. We show the usage of this dataset by an application that automatically extracts a graph representation from an SVC image. This is done by training a model via common supervised learning methods. As there currently exist no large-scale public datasets for the detailed analysis of SVC, we propose the Synthetic SVC (SSVC) dataset comprising 12,000 images with respective bounding box annotations and detailed graph representations. Our dataset enables the development of strong models for the interpretation of SVCs while skipping the time-consuming dense data annotation. We evaluate our model on both synthetic and manually annotated data and show the transferability of synthetic to real via various metrics, given the presented application. Here, we evaluate that this proof of concept is possible to some extend and lay down a solid baseline for this task. We discuss the limitations of our approach for further improvements. Our utilized metrics can be used as a tool for future comparisons in this domain. To enable further research on this task, the dataset is publicly available at https://bit.ly/3jN1pJJ

Related papers

RefChartQA: Grounding Visual Answer on Chart Images through Instruction Tuning [63.599057862999]
RefChartQA is a novel benchmark that integrates Chart Question Answering (ChartQA) with visual grounding. Our experiments demonstrate that incorporating spatial awareness via grounding improves response accuracy by over 15%.
arXiv Detail & Related papers (2025-03-29T15:50:08Z)
Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data. We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation. Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z)
Diffusion Models as Data Mining Tools [87.77999285241219]
This paper demonstrates how to use generative models trained for image synthesis as tools for visual data mining. We show that after finetuning conditional diffusion models to synthesize images from a specific dataset, we can use these models to define a typicality measure. This measure assesses how typical visual elements are for different data labels, such as geographic location, time stamps, semantic labels, or even the presence of a disease.
arXiv Detail & Related papers (2024-07-20T17:14:31Z)
Wiki Entity Summarization Benchmark [9.25319552487389]
Entity summarization aims to compute concise summaries for entities in knowledge graphs. Existing datasets and benchmarks are often limited to a few hundred entities. We propose WikES, a comprehensive benchmark comprising of entities, their summaries, and their connections.
arXiv Detail & Related papers (2024-06-12T17:22:00Z)
Visual Data-Type Understanding does not emerge from Scaling Vision-Language Models [31.69213233651326]
We introduce the novel task of Visual Data-Type Identification. An extensive zero-shot evaluation of 39 vision-language models (VLMs) shows a nuanced performance landscape.
arXiv Detail & Related papers (2023-10-12T17:59:30Z)
DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models [61.906934570771256]
We present a generic dataset generation model that can produce diverse synthetic images and perception annotations. Our method builds upon the pre-trained diffusion model and extends text-guided image synthesis to perception data generation. We show that the rich latent code of the diffusion model can be effectively decoded as accurate perception annotations using a decoder module.
arXiv Detail & Related papers (2023-08-11T14:38:11Z)
Learnable Graph Matching: A Practical Paradigm for Data Association [74.28753343714858]
We propose a general learnable graph matching method to address these issues. Our method achieves state-of-the-art performance on several MOT datasets. For image matching, our method outperforms state-of-the-art methods on a popular indoor dataset, ScanNet.
arXiv Detail & Related papers (2023-03-27T17:39:00Z)
Modeling Entities as Semantic Points for Visual Information Extraction in the Wild [55.91783742370978]
We propose an alternative approach to precisely and robustly extract key information from document images. We explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities. The proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-23T08:21:16Z)
Explaining Deep Convolutional Neural Networks via Latent Visual-Semantic Filter Attention [7.237370981736913]
We propose a framework to teach any existing convolutional neural network to generate text descriptions about its own latent representations at the filter level. We show that our method can generate novel descriptions for learned filters beyond the set of categories defined in the training dataset. We also demonstrate a novel application of our method for unsupervised dataset bias analysis.
arXiv Detail & Related papers (2022-04-10T04:57:56Z)
Beyond Accuracy: A Consolidated Tool for Visual Question Answering Benchmarking [30.155625852894797]
We propose a browser-based benchmarking tool for researchers and challenge organizers. Our tool helps test generalization capabilities of models across multiple datasets. Interactive filtering facilitates discovery of problematic behavior.
arXiv Detail & Related papers (2021-10-11T11:08:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.