Combining Visual and Textual Features for Semantic Segmentation of
Historical Newspapers
- URL: http://arxiv.org/abs/2002.06144v4
- Date: Mon, 14 Dec 2020 16:56:29 GMT
- Title: Combining Visual and Textual Features for Semantic Segmentation of
Historical Newspapers
- Authors: Rapha\"el Barman, Maud Ehrmann, Simon Clematide, Sofia Ares Oliveira,
Fr\'ed\'eric Kaplan
- Abstract summary: We introduce a multimodal approach for the semantic segmentation of historical newspapers.
Based on experiments on diachronic Swiss and Luxembourgish newspapers, we investigate the predictive power of visual and textual features.
Results show consistent improvement of multimodal models in comparison to a strong visual baseline.
- Score: 2.5899040911480187
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The massive amounts of digitized historical documents acquired over the last
decades naturally lend themselves to automatic processing and exploration.
Research work seeking to automatically process facsimiles and extract
information thereby are multiplying with, as a first essential step, document
layout analysis. If the identification and categorization of segments of
interest in document images have seen significant progress over the last years
thanks to deep learning techniques, many challenges remain with, among others,
the use of finer-grained segmentation typologies and the consideration of
complex, heterogeneous documents such as historical newspapers. Besides, most
approaches consider visual features only, ignoring textual signal. In this
context, we introduce a multimodal approach for the semantic segmentation of
historical newspapers that combines visual and textual features. Based on a
series of experiments on diachronic Swiss and Luxembourgish newspapers, we
investigate, among others, the predictive power of visual and textual features
and their capacity to generalize across time and sources. Results show
consistent improvement of multimodal models in comparison to a strong visual
baseline, as well as better robustness to high material variance.
Related papers
- Unified Multi-Modal Interleaved Document Representation for Information Retrieval [57.65409208879344]
We produce more comprehensive and nuanced document representations by holistically embedding documents interleaved with different modalities.
Specifically, we achieve this by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation.
arXiv Detail & Related papers (2024-10-03T17:49:09Z) - Leveraging Open-Vocabulary Diffusion to Camouflaged Instance
Segmentation [59.78520153338878]
Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions.
We propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations.
arXiv Detail & Related papers (2023-12-29T07:59:07Z) - Prompt me a Dataset: An investigation of text-image prompting for
historical image dataset creation using foundation models [0.9065034043031668]
We present a pipeline for image extraction from historical documents using foundation models.
We evaluate text-image prompts and their effectiveness on humanities datasets of varying levels of complexity.
arXiv Detail & Related papers (2023-09-04T15:37:03Z) - The Learnable Typewriter: A Generative Approach to Text Analysis [17.355857281085164]
We present a generative document-specific approach to character analysis and recognition in text lines.
Taking as input a set of text lines with similar font or handwriting, our approach can learn a large number of different characters.
arXiv Detail & Related papers (2023-02-03T11:17:59Z) - Holistic Visual-Textual Sentiment Analysis with Prior Models [64.48229009396186]
We propose a holistic method that achieves robust visual-textual sentiment analysis.
The proposed method consists of four parts: (1) a visual-textual branch to learn features directly from data for sentiment analysis, (2) a visual expert branch with a set of pre-trained "expert" encoders to extract selected semantic visual features, (3) a CLIP branch to implicitly model visual-textual correspondence, and (4) a multimodal feature fusion network based on BERT to fuse multimodal features and make sentiment predictions.
arXiv Detail & Related papers (2022-11-23T14:40:51Z) - TRIE++: Towards End-to-End Information Extraction from Visually Rich
Documents [51.744527199305445]
This paper proposes a unified end-to-end information extraction framework from visually rich documents.
Text reading and information extraction can reinforce each other via a well-designed multi-modal context block.
The framework can be trained in an end-to-end trainable manner, achieving global optimization.
arXiv Detail & Related papers (2022-07-14T08:52:07Z) - Where Does the Performance Improvement Come From? - A Reproducibility
Concern about Image-Text Retrieval [85.03655458677295]
Image-text retrieval has gradually become a major research direction in the field of information retrieval.
We first examine the related concerns and why the focus is on image-text retrieval tasks.
We analyze various aspects of the reproduction of pretrained and nonpretrained retrieval models.
arXiv Detail & Related papers (2022-03-08T05:01:43Z) - Including Keyword Position in Image-based Models for Act Segmentation of
Historical Registers [2.064923532131528]
We focus on the use of both visual and textual information for segmenting historical registers into structured and meaningful units such as acts.
An act is a text recording containing valuable knowledge such as demographic information (baptism, marriage or death) or royal decisions (donation or pardon)
arXiv Detail & Related papers (2021-09-17T11:38:34Z) - From Show to Tell: A Survey on Image Captioning [48.98681267347662]
Connecting Vision and Language plays an essential role in Generative Intelligence.
Research in image captioning has not reached a conclusive answer yet.
This work aims at providing a comprehensive overview and categorization of image captioning approaches.
arXiv Detail & Related papers (2021-07-14T18:00:54Z) - The Devil is in the Details: Evaluating Limitations of Transformer-based
Methods for Granular Tasks [19.099852869845495]
Contextual embeddings derived from transformer-based neural language models have shown state-of-the-art performance for various tasks.
We focus on the problem of textual similarity from two perspectives: matching documents on a granular level, and an abstract level.
We empirically demonstrate, across two datasets from different domains, that despite high performance in abstract document matching as expected, contextual embeddings are consistently (and at times, vastly) outperformed by simple baselines like TF-IDF for more granular tasks.
arXiv Detail & Related papers (2020-11-02T18:41:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.