OCR-VQGAN: Taming Text-within-Image Generation
- URL: http://arxiv.org/abs/2210.11248v1
- Date: Wed, 19 Oct 2022 16:37:48 GMT
- Title: OCR-VQGAN: Taming Text-within-Image Generation
- Authors: Juan A. Rodriguez, David Vazquez, Issam Laradji, Marco Pedersoli, Pau
Rodriguez
- Abstract summary: We present OCR-VQGAN, an image encoder, and decoder that leverages OCR pre-trained features to optimize a text perceptual loss.
We demonstrate the effectiveness of OCR-VQGAN by conducting several experiments on the task of figure reconstruction.
- Score: 4.5718306968064635
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Synthetic image generation has recently experienced significant improvements
in domains such as natural image or art generation. However, the problem of
figure and diagram generation remains unexplored. A challenging aspect of
generating figures and diagrams is effectively rendering readable texts within
the images. To alleviate this problem, we present OCR-VQGAN, an image encoder,
and decoder that leverages OCR pre-trained features to optimize a text
perceptual loss, encouraging the architecture to preserve high-fidelity text
and diagram structure. To explore our approach, we introduce the Paper2Fig100k
dataset, with over 100k images of figures and texts from research papers. The
figures show architecture diagrams and methodologies of articles available at
arXiv.org from fields like artificial intelligence and computer vision. Figures
usually include text and discrete objects, e.g., boxes in a diagram, with lines
and arrows that connect them. We demonstrate the effectiveness of OCR-VQGAN by
conducting several experiments on the task of figure reconstruction.
Additionally, we explore the qualitative and quantitative impact of weighting
different perceptual metrics in the overall loss function. We release code,
models, and dataset at https://github.com/joanrod/ocr-vqgan.
Related papers
- See then Tell: Enhancing Key Information Extraction with Vision Grounding [54.061203106565706]
We introduce STNet (See then Tell Net), a novel end-to-end model designed to deliver precise answers with relevant vision grounding.
To enhance the model's seeing capabilities, we collect extensive structured table recognition datasets.
arXiv Detail & Related papers (2024-09-29T06:21:05Z) - Textual Inversion and Self-supervised Refinement for Radiology Report Generation [25.779160968864435]
We propose Textual Inversion and Self-supervised Refinement (TISR) for generating radiology reports.
TISR projects text and image into the same space by representing images as pseudo words to eliminate the cross-modeling gap.
We conduct experiments on two widely-used public datasets and achieve significant improvements on various baselines.
arXiv Detail & Related papers (2024-05-31T03:47:44Z) - Text Image Inpainting via Global Structure-Guided Diffusion Models [22.859984320894135]
Real-world text can be damaged by corrosion issues caused by environmental or human factors.
Current inpainting techniques often fail to adequately address this problem.
We develop a novel neural framework, Global Structure-guided Diffusion Model (GSDM), as a potential solution.
arXiv Detail & Related papers (2024-01-26T13:01:28Z) - Benchmarking Robustness of Text-Image Composed Retrieval [46.98557472744255]
Text-image composed retrieval aims to retrieve the target image through the composed query.
It has recently attracted attention due to its ability to leverage both information-rich images and concise language.
However, the robustness of these approaches against real-world corruptions or further text understanding has never been studied.
arXiv Detail & Related papers (2023-11-24T20:16:38Z) - Re-Imagen: Retrieval-Augmented Text-to-Image Generator [58.60472701831404]
Retrieval-Augmented Text-to-Image Generator (Re-Imagen)
Retrieval-Augmented Text-to-Image Generator (Re-Imagen)
arXiv Detail & Related papers (2022-09-29T00:57:28Z) - Retrieval-Augmented Transformer for Image Captioning [51.79146669195357]
We develop an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process.
Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens.
Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality.
arXiv Detail & Related papers (2022-07-26T19:35:49Z) - Integrating Visuospatial, Linguistic and Commonsense Structure into
Story Visualization [81.26077816854449]
We first explore the use of constituency parse trees for encoding structured input.
Second, we augment the structured input with commonsense information and study the impact of this external knowledge on the generation of visual story.
Third, we incorporate visual structure via bounding boxes and dense captioning to provide feedback about the characters/objects in generated images.
arXiv Detail & Related papers (2021-10-21T00:16:02Z) - DAE-GAN: Dynamic Aspect-aware GAN for Text-to-Image Synthesis [55.788772366325105]
We propose a Dynamic Aspect-awarE GAN (DAE-GAN) that represents text information comprehensively from multiple granularities, including sentence-level, word-level, and aspect-level.
Inspired by human learning behaviors, we develop a novel Aspect-aware Dynamic Re-drawer (ADR) for image refinement, in which an Attended Global Refinement (AGR) module and an Aspect-aware Local Refinement (ALR) module are alternately employed.
arXiv Detail & Related papers (2021-08-27T07:20:34Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z) - RTIC: Residual Learning for Text and Image Composition using Graph
Convolutional Network [19.017377597937617]
We study the compositional learning of images and texts for image retrieval.
We introduce a novel method that combines the graph convolutional network (GCN) with existing composition methods.
arXiv Detail & Related papers (2021-04-07T09:41:52Z) - PICK: Processing Key Information Extraction from Documents using
Improved Graph Learning-Convolutional Networks [5.210482046387142]
Key Information Extraction from documents remains a challenge.
We introduce PICK, a framework that is effective and robust in handling complex documents layout for KIE.
Our method outperforms baselines methods by significant margins.
arXiv Detail & Related papers (2020-04-16T05:20:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.