Related papers: Seeing Straight: Document Orientation Detection for Efficient OCR

Seeing Straight: Document Orientation Detection for Efficient OCR

URL: http://arxiv.org/abs/2511.04161v1
Date: Thu, 06 Nov 2025 08:04:57 GMT
Title: Seeing Straight: Document Orientation Detection for Efficient OCR
Authors: Suranjan Goswami, Abhinav Ravi, Raja Kolla, Ali Faraz, Shaharukh Khan, Akash, Chandra Khatri, Shubham Agarwal,
Abstract summary: OCR-Rotation-Bench (ORB) is a new benchmark for evaluating OCR to image rotations.<n>We present a fast, robust and lightweight rotation classification pipeline built on the vision encoder of Phi-3.5-Vision model.<n>Our method achieves near-perfect 96% and 92% accuracy on identifying the rotations respectively on both the datasets.
Score: 2.7873355152549344
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite significant advances in document understanding, determining the correct orientation of scanned or photographed documents remains a critical pre-processing step in the real world settings. Accurate rotation correction is essential for enhancing the performance of downstream tasks such as Optical Character Recognition (OCR) where misalignment commonly arises due to user errors, particularly incorrect base orientations of the camera during capture. In this study, we first introduce OCR-Rotation-Bench (ORB), a new benchmark for evaluating OCR robustness to image rotations, comprising (i) ORB-En, built from rotation-transformed structured and free-form English OCR datasets, and (ii) ORB-Indic, a novel multilingual set spanning 11 Indic mid to low-resource languages. We also present a fast, robust and lightweight rotation classification pipeline built on the vision encoder of Phi-3.5-Vision model with dynamic image cropping, fine-tuned specifically for 4-class rotation task in a standalone fashion. Our method achieves near-perfect 96% and 92% accuracy on identifying the rotations respectively on both the datasets. Beyond classification, we demonstrate the critical role of our module in boosting OCR performance: closed-source (up to 14%) and open-weights models (up to 4x) in the simulated real-world setting.

Related papers

Rotation Equivariant Arbitrary-scale Image Super-Resolution [62.41329042683779]
The arbitrary-scale image super-resolution (ASISR) aims to achieve arbitrary-scale high-resolution recoveries from a low-resolution input image.<n>We make efforts to construct a rotation equivariant ASISR method in this study.
arXiv Detail & Related papers (2025-08-07T08:51:03Z)
Feature Alignment with Equivariant Convolutions for Burst Image Super-Resolution [52.55429225242423]
We propose a novel framework for Burst Image Super-Resolution (BISR), featuring an equivariant convolution-based alignment.<n>This enables the alignment transformation to be learned via explicit supervision in the image domain and easily applied in the feature domain.<n>Experiments on BISR benchmarks show the superior performance of our approach in both quantitative metrics and visual quality.
arXiv Detail & Related papers (2025-03-11T11:13:10Z)
Mixed Text Recognition with Efficient Parameter Fine-Tuning and Transformer [12.966765239586994]
This paper proposes DLoRA-TrOCR, a parameter-efficient hybrid text spotting method based on a pre-trained OCR Transformer.<n>By embedding a weight-decomposed DoRA module in the image encoder and a LoRA module in the text decoder, this method can be efficiently fine-tuned on various downstream tasks.<n> Experiments show that our proposed DLoRA-TrOCR outperforms other parameter-efficient fine-tuning methods in recognizing complex scenes with mixed handwritten, printed, and street text.
arXiv Detail & Related papers (2024-04-19T09:28:16Z)
SPE-Net: Boosting Point Cloud Analysis via Rotation Robustness Enhancement [118.20816888815658]
We propose a novel deep architecture tailored for 3D point cloud applications, named as SPE-Net. The embedded Selective Position variant' procedure relies on an attention mechanism that can effectively attend to the underlying rotation condition of the input. We demonstrate the merits of the SPE-Net and the associated hypothesis on four benchmarks, showing evident improvements on both rotated and unrotated test data over SOTA methods.
arXiv Detail & Related papers (2022-11-15T15:59:09Z)
Rolling Shutter Inversion: Bring Rolling Shutter Images to High Framerate Global Shutter Video [111.08121952640766]
This paper presents a novel deep-learning based solution to the RS temporal super-resolution problem. By leveraging the multi-view geometry relationship of the RS imaging process, our framework successfully achieves high framerate GS generation. Our method can produce high-quality GS image sequences with rich details, outperforming the state-of-the-art methods.
arXiv Detail & Related papers (2022-10-06T16:47:12Z)
3D Rendering Framework for Data Augmentation in Optical Character Recognition [8.641647607173864]
We propose a data augmentation framework for Optical Character Recognition (OCR) The proposed framework is able to synthesize new viewing angles and illumination scenarios. We demonstrate the performance of our framework by augmenting a 15% subset of the common Brno Mobile OCR dataset.
arXiv Detail & Related papers (2022-09-27T19:31:23Z)
An Evaluation of OCR on Egocentric Data [30.637021477342035]
In this paper, we evaluate state-of-the-art OCR methods on Egocentric data. We demonstrate that existing OCR methods struggle with rotated text, which is frequently observed on objects being handled. We introduce a simple rotate-and-merge procedure which can be applied to pre-trained OCR models that halves the normalized edit distance error.
arXiv Detail & Related papers (2022-06-11T10:37:20Z)
Donut: Document Understanding Transformer without OCR [17.397447819420695]
We propose a novel VDU model that is end-to-end trainable without underpinning OCR framework. Our approach achieves state-of-the-art performance on various document understanding tasks in public benchmark datasets and private industrial service datasets.
arXiv Detail & Related papers (2021-11-30T18:55:19Z)
Lights, Camera, Action! A Framework to Improve NLP Accuracy over OCR documents [2.6201102730518606]
We demonstrate an effective framework for mitigating OCR errors for any downstream NLP task. We first address the data scarcity problem for model training by constructing a document synthesis pipeline. For the benefit of the community, we have made the document synthesis pipeline available as an open-source project.
arXiv Detail & Related papers (2021-08-06T00:32:54Z)
Robust Reference-based Super-Resolution via C2-Matching [77.51610726936657]
Super-Resolution (Ref-SR) has recently emerged as a promising paradigm to enhance a low-resolution (LR) input image by introducing an additional high-resolution (HR) reference image. Existing Ref-SR methods mostly rely on implicit correspondence matching to borrow HR textures from reference images to compensate for the information loss in input images. We propose C2-Matching, which produces explicit robust matching crossing transformation and resolution.
arXiv Detail & Related papers (2021-06-03T16:40:36Z)
Neural BRDF Representation and Importance Sampling [79.84316447473873]
We present a compact neural network-based representation of reflectance BRDF data. We encode BRDFs as lightweight networks, and propose a training scheme with adaptive angular sampling. We evaluate encoding results on isotropic and anisotropic BRDFs from multiple real-world datasets.
arXiv Detail & Related papers (2021-02-11T12:00:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.