Related papers: Seeing Justice Clearly: Handwritten Legal Document Translation with OCR and Vision-Language Models

Seeing Justice Clearly: Handwritten Legal Document Translation with OCR and Vision-Language Models

URL: http://arxiv.org/abs/2512.18004v1
Date: Fri, 19 Dec 2025 19:06:14 GMT
Title: Seeing Justice Clearly: Handwritten Legal Document Translation with OCR and Vision-Language Models
Authors: Shubham Kumar Nigam, Parjanya Aditya Shukla, Noel Shallum, Arnab Bhattacharya,
Abstract summary: Handwritten text recognition (HTR) and machine translation continue to pose significant challenges.<n>Traditional OCR system extracts text from handwritten images, which is then translated into the target language using a machine translation model.<n>In this work, we compare the performance of traditional OCR-MT pipelines with Vision Large Language Models that aim to unify these stages.<n>Our motivation is grounded in the urgent need for scalable, accurate translation systems to digitize legal records in India's district and high courts.
Score: 8.62418063092899
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Handwritten text recognition (HTR) and machine translation continue to pose significant challenges, particularly for low-resource languages like Marathi, which lack large digitized corpora and exhibit high variability in handwriting styles. The conventional approach to address this involves a two-stage pipeline: an OCR system extracts text from handwritten images, which is then translated into the target language using a machine translation model. In this work, we explore and compare the performance of traditional OCR-MT pipelines with Vision Large Language Models that aim to unify these stages and directly translate handwritten text images in a single, end-to-end step. Our motivation is grounded in the urgent need for scalable, accurate translation systems to digitize legal records such as FIRs, charge sheets, and witness statements in India's district and high courts. We evaluate both approaches on a curated dataset of handwritten Marathi legal documents, with the goal of enabling efficient legal document processing, even in low-resource environments. Our findings offer actionable insights toward building robust, edge-deployable solutions that enhance access to legal information for non-native speakers and legal professionals alike.

Related papers

LEMUR: A Corpus for Robust Fine-Tuning of Multilingual Law Embedding Models for Retrieval [18.46710400838861]
Large language models (LLMs) are increasingly used to access legal information.<n>Yet, their deployment in multilingual legal settings is constrained by unreliable retrieval and the lack of domain-adapted, open-embedding models.<n>We introduce LEMUR, a large-scale multilingual corpus of EU environmental legislation constructed from 24,953 official EUR-Lex PDF documents covering 25 languages.
arXiv Detail & Related papers (2026-02-10T09:20:24Z)
Handwritten Text Recognition for Low Resource Languages [4.4322265742680305]
This paper introduces BharatOCR, a novel segmentation-free paragraph-level handwritten Hindi and Urdu text recognition.<n>We propose a ViT-Transformer Decoder-LM architecture for handwritten text recognition, where a Vision Transformer (ViT) extracts visual features, a Transformer decoder generates text sequences, and a pre-trained language model (LM) refines the output to improve accuracy, fluency, and coherence.<n>The proposed model was evaluated using our custom dataset ('Parimal Urdu') and ('Parimal Hindi'), introduced in this research work, as well as two public datasets.
arXiv Detail & Related papers (2025-12-01T07:01:52Z)
SERVAL: Surprisingly Effective Zero-Shot Visual Document Retrieval Powered by Large Vision and Language Models [17.85605201420847]
Visual Document Retrieval (VDR) typically operates as text-to-image retrieval using specialized bi-encoders trained to directly embed document images.<n>We revisit a zero-shot generate-and-encode pipeline: a vision-language model first produces a detailed textual description of each document image.<n>On the ViDoRe-v2 benchmark, the method reaches 63.4% nDCG@5, surpassing the strongest specialised multi-vector visual document encoder.
arXiv Detail & Related papers (2025-09-18T21:11:13Z)
UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model [108.85584502396182]
We propose UReader, a first exploration of universal OCR-free visually-situated language understanding based on the Multimodal Large Language Model (MLLM) By leveraging the shallow text recognition ability of the MLLM, we only finetuned 1.2% parameters. Our single model achieves state-of-the-art ocr-free performance in 8 out of 10 visually-situated language understanding tasks.
arXiv Detail & Related papers (2023-10-08T11:33:09Z)
Improving Joint Speech-Text Representations Without Alignment [92.60384956736536]
We show that joint speech-text encoders naturally achieve consistent representations across modalities by disregarding sequence length. We argue that consistency losses could forgive length differences and simply assume the best alignment.
arXiv Detail & Related papers (2023-08-11T13:28:48Z)
Cross-modality Data Augmentation for End-to-End Sign Language Translation [66.46877279084083]
End-to-end sign language translation (SLT) aims to convert sign language videos into spoken language texts directly without intermediate representations. It has been a challenging task due to the modality gap between sign videos and texts and the data scarcity of labeled data. We propose a novel Cross-modality Data Augmentation (XmDA) framework to transfer the powerful gloss-to-text translation capabilities to end-to-end sign language translation.
arXiv Detail & Related papers (2023-05-18T16:34:18Z)
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks. OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z)
LegalRelectra: Mixed-domain Language Modeling for Long-range Legal Text Comprehension [6.442209435258797]
LegalRelectra is a legal-domain language model trained on mixed-domain legal and medical corpora. Our training architecture implements the Electra framework, but utilizes Reformer instead of BERT for its generator and discriminator.
arXiv Detail & Related papers (2022-12-16T00:15:14Z)
Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection. We propose to learn contextualized, joint representations through vision-language pre-training. The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z)
Lacuna Reconstruction: Self-supervised Pre-training for Low-Resource Historical Document Transcription [25.76860672652937]
We show a meaningful improvement in recognition accuracy over the same supervised model trained from scratch with as few as 30 line image transcriptions for training. Our masked language model-style pre-training strategy, where the model is trained to be able to identify the true masked visual representation from distractors sampled from within the same line, encourages learning robust contextualized language representations.
arXiv Detail & Related papers (2021-12-16T08:28:26Z)
LAWDR: Language-Agnostic Weighted Document Representations from Pre-trained Models [8.745407715423992]
Cross-lingual document representations enable language understanding in multilingual contexts. Large pre-trained language models such as BERT, XLM and XLM-RoBERTa have achieved great success when fine-tuned on sentence-level downstream tasks.
arXiv Detail & Related papers (2021-06-07T07:14:00Z)
UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning. We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM) Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.