Pho(SC)Net: An Approach Towards Zero-shot Word Image Recognition in
Historical Documents
- URL: http://arxiv.org/abs/2105.15093v1
- Date: Mon, 31 May 2021 16:22:33 GMT
- Title: Pho(SC)Net: An Approach Towards Zero-shot Word Image Recognition in
Historical Documents
- Authors: Anuj Rai, Narayanan C. Krishnan, and Sukalpa Chanda
- Abstract summary: Zero-shot learning methods could aptly be used to recognize unseen/out-of-lexicon words in historical document images.
We propose a hybrid representation that considers the character's shape appearance to differentiate between two different words.
Experiments were conducted to examine the effectiveness of an embedding that has properties of both PHOS and PHOC.
- Score: 2.502407331311937
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Annotating words in a historical document image archive for word image
recognition purpose demands time and skilled human resource (like historians,
paleographers). In a real-life scenario, obtaining sample images for all
possible words is also not feasible. However, Zero-shot learning methods could
aptly be used to recognize unseen/out-of-lexicon words in such historical
document images. Based on previous state-of-the-art methods for word spotting
and recognition, we propose a hybrid representation that considers the
character's shape appearance to differentiate between two different words and
has shown to be more effective in recognizing unseen words. This representation
has been termed as Pyramidal Histogram of Shapes (PHOS), derived from PHOC,
which embeds information about the occurrence and position of characters in the
word. Later, the two representations are combined and experiments were
conducted to examine the effectiveness of an embedding that has properties of
both PHOS and PHOC. Encouraging results were obtained on two publicly available
historical document datasets and one synthetic handwritten dataset, which
justifies the efficacy of "Phos" and the combined "Pho(SC)" representation.
Related papers
- Visual Analytics for Efficient Image Exploration and User-Guided Image
Captioning [35.47078178526536]
Recent advancements in pre-trained large-scale language-image models have ushered in a new era of visual comprehension.
This paper tackles two well-known issues within the realm of visual analytics: (1) the efficient exploration of large-scale image datasets and identification of potential data biases within them; (2) the evaluation of image captions and steering of their generation process.
arXiv Detail & Related papers (2023-11-02T06:21:35Z) - WordStylist: Styled Verbatim Handwritten Text Generation with Latent
Diffusion Models [8.334487584550185]
We present a latent diffusion-based method for styled text-to-text-content-image generation on word-level.
Our proposed method is able to generate realistic word image samples from different writer styles.
We show that the proposed model produces samples that are aesthetically pleasing, help boosting text recognition performance, and get similar writer retrieval score as real data.
arXiv Detail & Related papers (2023-03-29T10:19:26Z) - I2DFormer: Learning Image to Document Attention for Zero-Shot Image
Classification [123.90912800376039]
Online textual documents, e.g., Wikipedia, contain rich visual descriptions about object classes.
We propose I2DFormer, a novel transformer-based ZSL framework that jointly learns to encode images and documents.
Our method leads to highly interpretable results where document words can be grounded in the image regions.
arXiv Detail & Related papers (2022-09-21T12:18:31Z) - Reading and Writing: Discriminative and Generative Modeling for
Self-Supervised Text Recognition [101.60244147302197]
We introduce contrastive learning and masked image modeling to learn discrimination and generation of text images.
Our method outperforms previous self-supervised text recognition methods by 10.2%-20.2% on irregular scene text recognition datasets.
Our proposed text recognizer exceeds previous state-of-the-art text recognition methods by averagely 5.3% on 11 benchmarks, with similar model size.
arXiv Detail & Related papers (2022-07-01T03:50:26Z) - Knowledge Mining with Scene Text for Fine-Grained Recognition [53.74297368412834]
We propose an end-to-end trainable network that mines implicit contextual knowledge behind scene text image.
We employ KnowBert to retrieve relevant knowledge for semantic representation and combine it with image features for fine-grained classification.
Our method outperforms the state-of-the-art by 3.72% mAP and 5.39% mAP, respectively.
arXiv Detail & Related papers (2022-03-27T05:54:00Z) - Evaluating language-biased image classification based on semantic
representations [13.508894957080777]
Humans show language-biased image recognition for a word-embedded image, known as picture-word interference.
Similar to humans, recent artificial models jointly trained on texts and images, e.g., OpenAI CLIP, show language-biased image classification.
arXiv Detail & Related papers (2022-01-26T15:46:36Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z) - Consensus-Aware Visual-Semantic Embedding for Image-Text Matching [69.34076386926984]
Image-text matching plays a central role in bridging vision and language.
Most existing approaches only rely on the image-text instance pair to learn their representations.
We propose a Consensus-aware Visual-Semantic Embedding model to incorporate the consensus information.
arXiv Detail & Related papers (2020-07-17T10:22:57Z) - Learning to Recognise Words using Visually Grounded Speech [15.972015648122914]
The model has been trained on pairs of images and spoken captions to create visually grounded embeddings.
We investigate whether such a model can be used to recognise words by embedding isolated words and using them to retrieve images of their visual referents.
arXiv Detail & Related papers (2020-05-31T12:48:37Z) - Grounded and Controllable Image Completion by Incorporating Lexical
Semantics [111.47374576372813]
Lexical Semantic Image Completion (LSIC) may have potential applications in art, design, and heritage conservation.
We advocate generating results faithful to both visual and lexical semantic context.
One major challenge for LSIC comes from modeling and aligning the structure of visual-semantic context.
arXiv Detail & Related papers (2020-02-29T16:54:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.