Related papers: Optimizing Nepali PDF Extraction: A Comparative Study of Parser and OCR Technologies

Related papers

Complete Evasion, Zero Modification: PDF Attacks on AI Text Detection [0.0]
We present PDFuzz, a novel attack that exploits discrepancy between visual text layout and extraction order in PDF documents.<n>We evaluate this approach against the ArguGPT detector using a dataset of human and AI-generated text.
arXiv Detail & Related papers (2025-08-03T18:43:41Z)
TFIC: End-to-End Text-Focused Image Compression for Coding for Machines [50.86328069558113]
We present an image compression system designed to retain text-specific features for subsequent Optical Character Recognition (OCR) Our encoding process requires half the time needed by the OCR module, making it especially suitable for devices with limited computational capacity.
arXiv Detail & Related papers (2025-03-25T09:36:13Z)
SparseFormer: Detecting Objects in HRW Shots via Sparse Vision Transformer [62.11796778482088]
We present a novel model-agnostic sparse vision transformer, dubbed SparseFormer, to bridge the gap of object detection between close-up and HRW shots. The proposed SparseFormer selectively uses attentive tokens to scrutinize the sparsely distributed windows that may contain objects. experiments on two HRW benchmarks, PANDA and DOTA-v1.0, demonstrate that the proposed SparseFormer significantly improves detection accuracy (up to 5.8%) and speed (up to 3x) over the state-of-the-art approaches.
arXiv Detail & Related papers (2025-02-11T03:21:25Z)
Comparative analysis of optical character recognition methods for Sámi texts from the National Library of Norway [0.2796197251957244]
We evaluate and improve OCR for text written in S'ami languages. Our results show that Transkribus and TrOCR outperform Tesseract on this task. We also show that fine-tuning pre-trained models and supplementing manual annotations can yield accurate OCR for S'ami languages.
arXiv Detail & Related papers (2025-01-13T13:07:51Z)
Semantics Prompting Data-Free Quantization for Low-Bit Vision Transformers [59.772673692679085]
We propose SPDFQ, a Semantics Prompting Data-Free Quantization method for ViTs. First, SPDFQ incorporates Attention Priors Alignment (APA), which uses randomly generated attention priors to enhance the semantics of synthetic images. Second, SPDFQ introduces Multi-Semantic Reinforcement (MSR), which utilizes localized patch optimization to prompt efficient parameterization.
arXiv Detail & Related papers (2024-12-21T09:30:45Z)
A Comparative Study of PDF Parsing Tools Across Diverse Document Categories [0.0]
We compare 10 popular PDF parsing tools across 6 document categories using the DocLayNet dataset. For text extraction, PyMuPDF and pypdfium generally outperformed others, but all text extractions struggled with Scientific and Patent documents. In table detection, TATR excelled in the Financial, Patent, Law & Regulations, and Scientific categories.
arXiv Detail & Related papers (2024-10-13T15:11:31Z)
PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling [63.93112754821312]
Document understanding is a challenging task to process and comprehend large amounts of textual and visual information. Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task. We introduce PDF-WuKong, a multimodal large language model (MLLM) which is designed to enhance multimodal question-answering (QA) for long PDF documents.
arXiv Detail & Related papers (2024-10-08T12:17:42Z)
Mero Nagarikta: Advanced Nepali Citizenship Data Extractor with Deep Learning-Powered Text Detection and OCR [0.0]
This work proposes a robust system using YOLOv8 for accurate text object detection and an OCR algorithm based on Optimized PyTesseract. The system, implemented within the context of a mobile application, allows for the automated extraction of important textual information. The tested PyTesseract optimized for Nepali characters outperformed the standard OCR regarding flexibility and accuracy.
arXiv Detail & Related papers (2024-10-08T06:29:08Z)
Hiding Sensitive Information Using PDF Steganography [3.6533698604619587]
We present a novel PDF steganography algorithm based upon least-significant bit insertion into the real-valued operands of PDF stream operators. We also provide a case study which embeds malware into a given cover PDF document.
arXiv Detail & Related papers (2024-05-01T20:54:12Z)
Training LLMs over Neurally Compressed Text [55.11828645767342]
This paper explores the idea of training large language models (LLMs) over highly compressed text. We propose Equal-Info Windows, a novel compression technique whereby text is segmented into blocks that each compress to the same bit length. We demonstrate effective learning over neurally compressed text that improves with scale, and outperforms byte-level baselines by a wide margin on perplexity and inference speed benchmarks.
arXiv Detail & Related papers (2024-04-04T17:48:28Z)
LOCR: Location-Guided Transformer for Optical Character Recognition [55.195165959662795]
We propose LOCR, a model that integrates location guiding into the transformer architecture during autoregression. We train the model on a dataset comprising over 77M text-location pairs from 125K academic document pages, including bounding boxes for words, tables and mathematical symbols. It outperforms all existing methods in our test set constructed from arXiv, as measured by edit distance, BLEU, METEOR and F-measure.
arXiv Detail & Related papers (2024-03-04T15:34:12Z)
User-Centric Evaluation of OCR Systems for Kwak'wala [92.73847703011353]
We show that utilizing OCR reduces the time spent in the manual transcription of culturally valuable documents by over 50%. Our results demonstrate the potential benefits that OCR tools can have on downstream language documentation and revitalization efforts.
arXiv Detail & Related papers (2023-02-26T21:41:15Z)
PGNet: Real-time Arbitrarily-Shaped Text Spotting with Point Gathering Network [54.03560668182197]
We propose a novel fully convolutional Point Gathering Network (PGNet) for reading arbitrarily-shaped text in real-time. With a PG-CTC decoder, we gather high-level character classification vectors from two-dimensional space and decode them into text symbols without NMS and RoI operations. Experiments prove that the proposed method achieves competitive accuracy, meanwhile significantly improving the running speed.
arXiv Detail & Related papers (2021-04-12T13:27:34Z)
PDFFlow: hardware accelerating parton density access [0.0]
We present PDFFlow, a new software for fast evaluation of parton distribution functions (PDFs) PDFFlow is designed for platforms with hardware accelerators. We benchmark the performance of this library on multiple scenarios for the particle physics community.
arXiv Detail & Related papers (2020-12-15T11:22:12Z)
Scene Text Image Super-Resolution in the Wild [112.90416737357141]
Low-resolution text images are often seen in natural scenes such as documents captured by mobile phones. Previous single image super-resolution (SISR) methods are trained on synthetic low-resolution images. We pro-pose a real scene text SR dataset, termed TextZoom. It contains paired real low-resolution and high-resolution images captured by cameras with different focal length in the wild.
arXiv Detail & Related papers (2020-05-07T09:18:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.