Optimizing Nepali PDF Extraction: A Comparative Study of Parser and OCR Technologies
- URL: http://arxiv.org/abs/2407.04577v2
- Date: Tue, 09 Jul 2024 16:58:07 GMT
- Title: Optimizing Nepali PDF Extraction: A Comparative Study of Parser and OCR Technologies
- Authors: Prabin Paudel, Supriya Khadka, Ranju G. C., Rahul Shah,
- Abstract summary: This research compares PDF parsing and Optical Character Recognition (OCR) methods for extracting Nepali content from PDFs.
OCR, specifically PyTesseract, overcomes challenges with non-Unicode Nepali fonts.
Considering the project's emphasis on Nepali PDFs, PyTesseract emerges as the most suitable library.
- Score: 0.0
- License:
- Abstract: This research compares PDF parsing and Optical Character Recognition (OCR) methods for extracting Nepali content from PDFs. PDF parsing offers fast and accurate extraction but faces challenges with non-Unicode Nepali fonts. OCR, specifically PyTesseract, overcomes these challenges, providing versatility for both digital and scanned PDFs. The study reveals that while PDF parsers are faster, their accuracy fluctuates based on PDF types. In contrast, OCRs, with a focus on PyTesseract, demonstrate consistent accuracy at the expense of slightly longer extraction times. Considering the project's emphasis on Nepali PDFs, PyTesseract emerges as the most suitable library, balancing extraction speed and accuracy.
Related papers
- A Comparative Study of PDF Parsing Tools Across Diverse Document Categories [0.0]
We compare 10 popular PDF parsing tools across 6 document categories using the DocLayNet dataset.
For text extraction, PyMuPDF and pypdfium generally outperformed others, but all text extractions struggled with Scientific and Patent documents.
In table detection, TATR excelled in the Financial, Patent, Law & Regulations, and Scientific categories.
arXiv Detail & Related papers (2024-10-13T15:11:31Z) - PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling [63.93112754821312]
Document understanding is a challenging task to process and comprehend large amounts of textual and visual information.
Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task.
We introduce PDF-WuKong, a multimodal large language model (MLLM) which is designed to enhance multimodal question-answering (QA) for long PDF documents.
arXiv Detail & Related papers (2024-10-08T12:17:42Z) - Mero Nagarikta: Advanced Nepali Citizenship Data Extractor with Deep Learning-Powered Text Detection and OCR [0.0]
This work proposes a robust system using YOLOv8 for accurate text object detection and an OCR algorithm based on Optimized PyTesseract.
The system, implemented within the context of a mobile application, allows for the automated extraction of important textual information.
The tested PyTesseract optimized for Nepali characters outperformed the standard OCR regarding flexibility and accuracy.
arXiv Detail & Related papers (2024-10-08T06:29:08Z) - Hiding Sensitive Information Using PDF Steganography [3.6533698604619587]
We present a novel PDF steganography algorithm based upon least-significant bit insertion into the real-valued operands of PDF stream operators.
We also provide a case study which embeds malware into a given cover PDF document.
arXiv Detail & Related papers (2024-05-01T20:54:12Z) - Training LLMs over Neurally Compressed Text [55.11828645767342]
This paper explores the idea of training large language models (LLMs) over highly compressed text.
We propose Equal-Info Windows, a novel compression technique whereby text is segmented into blocks that each compress to the same bit length.
We demonstrate effective learning over neurally compressed text that improves with scale, and outperforms byte-level baselines by a wide margin on perplexity and inference speed benchmarks.
arXiv Detail & Related papers (2024-04-04T17:48:28Z) - LOCR: Location-Guided Transformer for Optical Character Recognition [55.195165959662795]
We propose LOCR, a model that integrates location guiding into the transformer architecture during autoregression.
We train the model on a dataset comprising over 77M text-location pairs from 125K academic document pages, including bounding boxes for words, tables and mathematical symbols.
It outperforms all existing methods in our test set constructed from arXiv, as measured by edit distance, BLEU, METEOR and F-measure.
arXiv Detail & Related papers (2024-03-04T15:34:12Z) - User-Centric Evaluation of OCR Systems for Kwak'wala [92.73847703011353]
We show that utilizing OCR reduces the time spent in the manual transcription of culturally valuable documents by over 50%.
Our results demonstrate the potential benefits that OCR tools can have on downstream language documentation and revitalization efforts.
arXiv Detail & Related papers (2023-02-26T21:41:15Z) - PGNet: Real-time Arbitrarily-Shaped Text Spotting with Point Gathering
Network [54.03560668182197]
We propose a novel fully convolutional Point Gathering Network (PGNet) for reading arbitrarily-shaped text in real-time.
With a PG-CTC decoder, we gather high-level character classification vectors from two-dimensional space and decode them into text symbols without NMS and RoI operations.
Experiments prove that the proposed method achieves competitive accuracy, meanwhile significantly improving the running speed.
arXiv Detail & Related papers (2021-04-12T13:27:34Z) - PDFFlow: hardware accelerating parton density access [0.0]
We present PDFFlow, a new software for fast evaluation of parton distribution functions (PDFs)
PDFFlow is designed for platforms with hardware accelerators.
We benchmark the performance of this library on multiple scenarios for the particle physics community.
arXiv Detail & Related papers (2020-12-15T11:22:12Z) - Scene Text Image Super-Resolution in the Wild [112.90416737357141]
Low-resolution text images are often seen in natural scenes such as documents captured by mobile phones.
Previous single image super-resolution (SISR) methods are trained on synthetic low-resolution images.
We pro-pose a real scene text SR dataset, termed TextZoom.
It contains paired real low-resolution and high-resolution images captured by cameras with different focal length in the wild.
arXiv Detail & Related papers (2020-05-07T09:18:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.