Segmenting Messy Text: Detecting Boundaries in Text Derived from
Historical Newspaper Images
- URL: http://arxiv.org/abs/2312.12773v1
- Date: Wed, 20 Dec 2023 05:17:06 GMT
- Title: Segmenting Messy Text: Detecting Boundaries in Text Derived from
Historical Newspaper Images
- Authors: Carol Anderson and Phil Crone (Ancestry.com)
- Abstract summary: We consider a challenging text segmentation task: dividing newspaper marriage announcement lists into units of one announcement each.
In many cases the information is not structured into sentences, and adjacent segments are not topically distinct from each other.
We present a novel deep learning-based model for segmenting such text and show that it significantly outperforms an existing state-of-the-art method on our task.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Text segmentation, the task of dividing a document into sections, is often a
prerequisite for performing additional natural language processing tasks.
Existing text segmentation methods have typically been developed and tested
using clean, narrative-style text with segments containing distinct topics.
Here we consider a challenging text segmentation task: dividing newspaper
marriage announcement lists into units of one announcement each. In many cases
the information is not structured into sentences, and adjacent segments are not
topically distinct from each other. In addition, the text of the announcements,
which is derived from images of historical newspapers via optical character
recognition, contains many typographical errors. As a result, these
announcements are not amenable to segmentation with existing techniques. We
present a novel deep learning-based model for segmenting such text and show
that it significantly outperforms an existing state-of-the-art method on our
task.
Related papers
- WAS: Dataset and Methods for Artistic Text Segmentation [57.61335995536524]
This paper focuses on the more challenging task of artistic text segmentation and constructs a real artistic text segmentation dataset.
We propose a decoder with the layer-wise momentum query to prevent the model from ignoring stroke regions of special shapes.
We also propose a skeleton-assisted head to guide the model to focus on the global structure.
arXiv Detail & Related papers (2024-07-31T18:29:36Z) - Image-Text Co-Decomposition for Text-Supervised Semantic Segmentation [28.24883865053459]
This paper aims to learn a model capable of segmenting arbitrary visual concepts within images by using only image-text pairs without dense annotations.
Existing methods have demonstrated that contrastive learning on image-text pairs effectively aligns visual segments with the meanings of texts.
A text often consists of multiple semantic concepts, whereas semantic segmentation strives to create semantically homogeneous segments.
arXiv Detail & Related papers (2024-04-05T17:25:17Z) - From Text Segmentation to Smart Chaptering: A Novel Benchmark for
Structuring Video Transcriptions [63.11097464396147]
We introduce a novel benchmark YTSeg focusing on spoken content that is inherently more unstructured and both topically and structurally diverse.
We also introduce an efficient hierarchical segmentation model MiniSeg, that outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2024-02-27T15:59:37Z) - Leveraging Open-Vocabulary Diffusion to Camouflaged Instance
Segmentation [59.78520153338878]
Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions.
We propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations.
arXiv Detail & Related papers (2023-12-29T07:59:07Z) - Shatter and Gather: Learning Referring Image Segmentation with Text
Supervision [52.46081425504072]
We present a new model that discovers semantic entities in input image and then combines such entities relevant to text query to predict the mask of the referent.
Our method was evaluated on four public benchmarks for referring image segmentation, where it clearly outperformed the existing method for the same task and recent open-vocabulary segmentation models on all the benchmarks.
arXiv Detail & Related papers (2023-08-29T15:39:15Z) - SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control.
In addition to a global text prompt that describes the entire scene, the user provides a segmentation map.
We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z) - Structured Summarization: Unified Text Segmentation and Segment Labeling
as a Generation Task [16.155438404910043]
We propose a single encoder-decoder neural network that can handle long documents and conversations.
We successfully show a way to solve the combined task as a pure generation task.
Our results establish a strong case for considering text segmentation and segment labeling as a whole.
arXiv Detail & Related papers (2022-09-28T01:08:50Z) - Unsupervised learning of text line segmentation by differentiating
coarse patterns [0.0]
We present an unsupervised deep learning method that embeds document image patches to a compact Euclidean space where distances correspond to a coarse text line pattern similarity.
Text line segmentation can be easily implemented using standard techniques with the embedded feature vectors.
We evaluate the method qualitatively and quantitatively on several variants of text line segmentation datasets to demonstrate its effectivity.
arXiv Detail & Related papers (2021-05-19T21:21:30Z) - Rethinking Text Segmentation: A Novel Dataset and A Text-Specific
Refinement Approach [34.63444886780274]
Text segmentation is a prerequisite in real-world text-related tasks.
We introduce Text Refinement Network (TexRNet), a novel text segmentation approach.
TexRNet consistently improves text segmentation performance by nearly 2% compared to other state-of-the-art segmentation methods.
arXiv Detail & Related papers (2020-11-27T22:50:09Z) - TextScanner: Reading Characters in Order for Robust Scene Text
Recognition [60.04267660533966]
TextScanner is an alternative approach for scene text recognition.
It generates pixel-wise, multi-channel segmentation maps for character class, position and order.
It also adopts RNN for context modeling and performs paralleled prediction for character position and class.
arXiv Detail & Related papers (2019-12-28T07:52:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.