Towards Automatic Parsing of Structured Visual Content through the Use
of Synthetic Data
- URL: http://arxiv.org/abs/2204.14136v1
- Date: Fri, 29 Apr 2022 14:44:52 GMT
- Title: Towards Automatic Parsing of Structured Visual Content through the Use
of Synthetic Data
- Authors: Lukas Scholch, Jonas Steinhauser, Maximilian Beichter, Constantin
Seibold, Kailun Yang, Merlin Kn\"able, Thorsten Schwarz, Alexander M\"adche,
and Rainer Stiefelhagen
- Abstract summary: We propose a synthetic dataset, containing Structured Visual Content (SVCs) in the form of images and ground truths.
We show the usage of this dataset by an application that automatically extracts a graph representation from an SVC image.
Our dataset enables the development of strong models for the interpretation of SVCs while skipping the time-consuming dense data annotation.
- Score: 65.68384124394699
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Structured Visual Content (SVC) such as graphs, flow charts, or the like are
used by authors to illustrate various concepts. While such depictions allow the
average reader to better understand the contents, images containing SVCs are
typically not machine-readable. This, in turn, not only hinders automated
knowledge aggregation, but also the perception of displayed in-formation for
visually impaired people. In this work, we propose a synthetic dataset,
containing SVCs in the form of images as well as ground truths. We show the
usage of this dataset by an application that automatically extracts a graph
representation from an SVC image. This is done by training a model via common
supervised learning methods. As there currently exist no large-scale public
datasets for the detailed analysis of SVC, we propose the Synthetic SVC (SSVC)
dataset comprising 12,000 images with respective bounding box annotations and
detailed graph representations. Our dataset enables the development of strong
models for the interpretation of SVCs while skipping the time-consuming dense
data annotation. We evaluate our model on both synthetic and manually annotated
data and show the transferability of synthetic to real via various metrics,
given the presented application. Here, we evaluate that this proof of concept
is possible to some extend and lay down a solid baseline for this task. We
discuss the limitations of our approach for further improvements. Our utilized
metrics can be used as a tool for future comparisons in this domain. To enable
further research on this task, the dataset is publicly available at
https://bit.ly/3jN1pJJ
Related papers
- Visual Data-Type Understanding does not emerge from Scaling
Vision-Language Models [31.69213233651326]
We introduce the novel task of Visual Data-Type Identification.
An extensive zero-shot evaluation of 39 vision-language models (VLMs) shows a nuanced performance landscape.
arXiv Detail & Related papers (2023-10-12T17:59:30Z) - DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion
Models [61.906934570771256]
We present a generic dataset generation model that can produce diverse synthetic images and perception annotations.
Our method builds upon the pre-trained diffusion model and extends text-guided image synthesis to perception data generation.
We show that the rich latent code of the diffusion model can be effectively decoded as accurate perception annotations using a decoder module.
arXiv Detail & Related papers (2023-08-11T14:38:11Z) - Learnable Graph Matching: A Practical Paradigm for Data Association [74.28753343714858]
We propose a general learnable graph matching method to address these issues.
Our method achieves state-of-the-art performance on several MOT datasets.
For image matching, our method outperforms state-of-the-art methods on a popular indoor dataset, ScanNet.
arXiv Detail & Related papers (2023-03-27T17:39:00Z) - Modeling Entities as Semantic Points for Visual Information Extraction
in the Wild [55.91783742370978]
We propose an alternative approach to precisely and robustly extract key information from document images.
We explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities.
The proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-23T08:21:16Z) - Understanding ME? Multimodal Evaluation for Fine-grained Visual
Commonsense [98.70218717851665]
It is unclear whether the models really understand the visual scene and underlying commonsense knowledge due to limited evaluation data resources.
We present a Multimodal Evaluation (ME) pipeline to automatically generate question-answer pairs to test models' understanding of the visual scene, text, and related knowledge.
We then take a step further to show that training with the ME data boosts the model's performance in standard VCR evaluation.
arXiv Detail & Related papers (2022-11-10T21:44:33Z) - Explaining Deep Convolutional Neural Networks via Latent Visual-Semantic
Filter Attention [7.237370981736913]
We propose a framework to teach any existing convolutional neural network to generate text descriptions about its own latent representations at the filter level.
We show that our method can generate novel descriptions for learned filters beyond the set of categories defined in the training dataset.
We also demonstrate a novel application of our method for unsupervised dataset bias analysis.
arXiv Detail & Related papers (2022-04-10T04:57:56Z) - MSeg: A Composite Dataset for Multi-domain Semantic Segmentation [100.17755160696939]
We present MSeg, a composite dataset that unifies semantic segmentation datasets from different domains.
We reconcile the generalization and bring the pixel-level annotations into alignment by relabeling more than 220,000 object masks in more than 80,000 images.
A model trained on MSeg ranks first on the WildDash-v1 leaderboard for robust semantic segmentation, with no exposure to WildDash data during training.
arXiv Detail & Related papers (2021-12-27T16:16:35Z) - Beyond Accuracy: A Consolidated Tool for Visual Question Answering
Benchmarking [30.155625852894797]
We propose a browser-based benchmarking tool for researchers and challenge organizers.
Our tool helps test generalization capabilities of models across multiple datasets.
Interactive filtering facilitates discovery of problematic behavior.
arXiv Detail & Related papers (2021-10-11T11:08:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.