Related papers: ClapperText: A Benchmark for Text Recognition in Low-Resource Archival Documents

ClapperText: A Benchmark for Text Recognition in Low-Resource Archival Documents

URL: http://arxiv.org/abs/2510.15557v1
Date: Fri, 17 Oct 2025 11:44:08 GMT
Title: ClapperText: A Benchmark for Text Recognition in Low-Resource Archival Documents
Authors: Tingyu Lin, Marco Peer, Florian Kleber, Robert Sablatnig,
Abstract summary: ClapperText is a benchmark dataset for handwritten and printed text recognition in visually degraded and low-resource settings.<n>The dataset is derived from 127 World War II-era archival video segments containing clapperboards.<n>Recognizing clapperboard text poses significant challenges, including motion blur, handwriting variation, exposure fluctuations, and cluttered backgrounds.
Score: 1.2875548392688383
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: This paper presents ClapperText, a benchmark dataset for handwritten and printed text recognition in visually degraded and low-resource settings. The dataset is derived from 127 World War II-era archival video segments containing clapperboards that record structured production metadata such as date, location, and camera-operator identity. ClapperText includes 9,813 annotated frames and 94,573 word-level text instances, 67% of which are handwritten and 1,566 are partially occluded. Each instance includes transcription, semantic category, text type, and occlusion status, with annotations available as rotated bounding boxes represented as 4-point polygons to support spatially precise OCR applications. Recognizing clapperboard text poses significant challenges, including motion blur, handwriting variation, exposure fluctuations, and cluttered backgrounds, mirroring broader challenges in historical document analysis where structured content appears in degraded, non-standard forms. We provide both full-frame annotations and cropped word images to support downstream tasks. Using a consistent per-video evaluation protocol, we benchmark six representative recognition and seven detection models under zero-shot and fine-tuned conditions. Despite the small training set (18 videos), fine-tuning leads to substantial performance gains, highlighting ClapperText's suitability for few-shot learning scenarios. The dataset offers a realistic and culturally grounded resource for advancing robust OCR and document understanding in low-resource archival contexts. The dataset and evaluation code are available at https://github.com/linty5/ClapperText.

Related papers

TextBite: A Historical Czech Document Dataset for Logical Page Segmentation [0.0]
Previous approaches have relied on OCR or precise geometry to define logical segmentation.<n>To avoid the need for OCR, we define the task purely as segmentation in the image domain.<n>We introduce TextBite, a dataset of historical Czech documents spanning the 18th to 20th centuries.<n>The dataset comprises 8,449 page images with 78,863 annotated segments of logically and thematically coherent text.
arXiv Detail & Related papers (2025-03-20T19:19:12Z)
DELINE8K: A Synthetic Data Pipeline for the Semantic Segmentation of Historical Documents [0.0]
Document semantic segmentation can facilitate document analysis tasks, including OCR, form classification, and document editing. Several synthetic datasets have been developed to distinguish handwriting from printed text, but they fall short in class variety and document diversity. We propose the most comprehensive document semantic segmentation pipeline to date, incorporating preprinted text, handwriting, and document backgrounds from over 10 sources. Our customized dataset exhibits superior performance on the NAFSS benchmark, demonstrating it as a promising tool in further research.
arXiv Detail & Related papers (2024-04-30T04:53:10Z)
LOCR: Location-Guided Transformer for Optical Character Recognition [55.195165959662795]
We propose LOCR, a model that integrates location guiding into the transformer architecture during autoregression. We train the model on a dataset comprising over 77M text-location pairs from 125K academic document pages, including bounding boxes for words, tables and mathematical symbols. It outperforms all existing methods in our test set constructed from arXiv, as measured by edit distance, BLEU, METEOR and F-measure.
arXiv Detail & Related papers (2024-03-04T15:34:12Z)
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training [119.03392147066093]
Recent autoregressive vision-language models have excelled in few-shot text generation tasks but face challenges in alignment tasks. We introduce the contrastive loss into text generation models, partitioning the language model into dedicated unimodal text processing and adept multimodal data handling components. To bridge this gap, this work introduces VideoDatasetName, an inaugural interleaved video-text dataset featuring comprehensive captions.
arXiv Detail & Related papers (2024-01-01T18:58:42Z)
Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs [96.54224331778195]
We present a text-grounding document understanding model, termed TGDoc, which enhances MLLMs with the ability to discern the spatial positioning of text within images. We formulate instruction tuning tasks including text detection, recognition, and spotting to facilitate the cohesive alignment between the visual encoder and large language model. Our method achieves state-of-the-art performance across multiple text-rich benchmarks, validating the effectiveness of our method.
arXiv Detail & Related papers (2023-11-22T06:46:37Z)
Handwritten and Printed Text Segmentation: A Signature Case Study [0.0]
We develop novel approaches to address the challenges of handwritten and printed text segmentation. Our objective is to recover text from different classes in their entirety, especially enhancing the segmentation performance on overlapping sections. Our best configuration outperforms prior work on two different datasets by 17.9% and 7.3% on IoU scores.
arXiv Detail & Related papers (2023-07-15T21:49:22Z)
TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture. TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling. It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z)
ContextCLIP: Contextual Alignment of Image-Text pairs on CLIP visual representations [4.588028371034406]
We propose ContextCLIP, a contextual and contrastive learning framework for the contextual alignment of image-text pairs. Our framework was observed to improve the image-text alignment by aligning text and image representations contextually in the joint embedding space. ContextCLIP showed good qualitative performance for text-to-image retrieval tasks and enhanced classification accuracy.
arXiv Detail & Related papers (2022-11-14T05:17:51Z)
Language Matters: A Weakly Supervised Pre-training Approach for Scene Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations. Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features. Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z)
TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text [23.04601165885908]
We propose TextOCR, an arbitrary-shaped scene text detection and recognition with 900k annotated words collected on real images. We show that current state-of-the-art text-recognition (OCR) models fail to perform well on TextOCR. We use a TextOCR trained OCR model to create PixelM4C model which can do scene text based reasoning on an image in an end-to-end fashion.
arXiv Detail & Related papers (2021-05-12T07:50:42Z)
Learning to Select Bi-Aspect Information for Document-Scale Text Content Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer. In detail, the input is a set of structured records and a reference text for describing another recordset. The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.