Visual Text Meets Low-level Vision: A Comprehensive Survey on Visual
Text Processing
- URL: http://arxiv.org/abs/2402.03082v1
- Date: Mon, 5 Feb 2024 15:13:20 GMT
- Title: Visual Text Meets Low-level Vision: A Comprehensive Survey on Visual
Text Processing
- Authors: Yan Shu, Weichao Zeng, Zhenhang Li, Fangmin Zhao, Yu Zhou
- Abstract summary: The field of visual text processing has experienced a surge in research, driven by the advent of fundamental generative models.
We present a comprehensive, multi-perspective analysis of recent advancements in this field.
- Score: 4.057550183467041
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual text, a pivotal element in both document and scene images, speaks
volumes and attracts significant attention in the computer vision domain.
Beyond visual text detection and recognition, the field of visual text
processing has experienced a surge in research, driven by the advent of
fundamental generative models. However, challenges persist due to the unique
properties and features that distinguish text from general objects. Effectively
leveraging these unique textual characteristics is crucial in visual text
processing, as observed in our study. In this survey, we present a
comprehensive, multi-perspective analysis of recent advancements in this field.
Initially, we introduce a hierarchical taxonomy encompassing areas ranging from
text image enhancement and restoration to text image manipulation, followed by
different learning paradigms. Subsequently, we conduct an in-depth discussion
of how specific textual features such as structure, stroke, semantics, style,
and spatial context are seamlessly integrated into various tasks. Furthermore,
we explore available public datasets and benchmark the reviewed methods on
several widely-used datasets. Finally, we identify principal challenges and
potential avenues for future research. Our aim is to establish this survey as a
fundamental resource, fostering continued exploration and innovation in the
dynamic area of visual text processing.
Related papers
- See then Tell: Enhancing Key Information Extraction with Vision Grounding [54.061203106565706]
We introduce STNet (See then Tell Net), a novel end-to-end model designed to deliver precise answers with relevant vision grounding.
To enhance the model's seeing capabilities, we collect extensive structured table recognition datasets.
arXiv Detail & Related papers (2024-09-29T06:21:05Z) - Leveraging Open-Vocabulary Diffusion to Camouflaged Instance
Segmentation [59.78520153338878]
Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions.
We propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations.
arXiv Detail & Related papers (2023-12-29T07:59:07Z) - Towards Improving Document Understanding: An Exploration on
Text-Grounding via MLLMs [96.54224331778195]
We present a text-grounding document understanding model, termed TGDoc, which enhances MLLMs with the ability to discern the spatial positioning of text within images.
We formulate instruction tuning tasks including text detection, recognition, and spotting to facilitate the cohesive alignment between the visual encoder and large language model.
Our method achieves state-of-the-art performance across multiple text-rich benchmarks, validating the effectiveness of our method.
arXiv Detail & Related papers (2023-11-22T06:46:37Z) - Unleashing the Imagination of Text: A Novel Framework for Text-to-image
Person Retrieval via Exploring the Power of Words [0.951828574518325]
We propose a novel framework to explore the power of words in sentences.
The framework employs the pre-trained full CLIP model as a dual encoder for the images and texts.
We introduce a cross-modal triplet loss tailored for handling hard samples, enhancing the model's ability to distinguish minor differences.
arXiv Detail & Related papers (2023-07-18T08:23:46Z) - From Show to Tell: A Survey on Image Captioning [48.98681267347662]
Connecting Vision and Language plays an essential role in Generative Intelligence.
Research in image captioning has not reached a conclusive answer yet.
This work aims at providing a comprehensive overview and categorization of image captioning approaches.
arXiv Detail & Related papers (2021-07-14T18:00:54Z) - Positioning yourself in the maze of Neural Text Generation: A
Task-Agnostic Survey [54.34370423151014]
This paper surveys the components of modeling approaches relaying task impacts across various generation tasks such as storytelling, summarization, translation etc.
We present an abstraction of the imperative techniques with respect to learning paradigms, pretraining, modeling approaches, decoding and the key challenges outstanding in the field in each of them.
arXiv Detail & Related papers (2020-10-14T17:54:42Z) - Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image
Classification and Retrieval [8.317191999275536]
This paper focuses on leveraging multi-modal content in the form of visual and textual cues to tackle the task of fine-grained image classification and retrieval.
We employ a Graph Convolutional Network to perform multi-modal reasoning and obtain relationship-enhanced features by learning a common semantic space between salient objects and text found in an image.
arXiv Detail & Related papers (2020-09-21T12:31:42Z) - Text Recognition in the Wild: A Survey [33.22076515689926]
This literature review attempts to present the entire picture of the field of scene text recognition.
It provides a comprehensive reference for people entering this field, and could be helpful to inspire future research.
arXiv Detail & Related papers (2020-05-07T13:57:04Z) - TextCaps: a Dataset for Image Captioning with Reading Comprehension [56.89608505010651]
Text is omnipresent in human environments and frequently critical to understand our surroundings.
To study how to comprehend text in the context of an image we collect a novel dataset, TextCaps, with 145k captions for 28k images.
Our dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase.
arXiv Detail & Related papers (2020-03-24T02:38:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.