MMOCR: A Comprehensive Toolbox for Text Detection, Recognition and
Understanding
- URL: http://arxiv.org/abs/2108.06543v1
- Date: Sat, 14 Aug 2021 14:10:23 GMT
- Title: MMOCR: A Comprehensive Toolbox for Text Detection, Recognition and
Understanding
- Authors: Zhanghui Kuang, Hongbin Sun, Zhizhong Li, Xiaoyu Yue, Tsui Hin Lin,
Jianyong Chen, Huaqiang Wei, Yiqin Zhu, Tong Gao, Wenwei Zhang, Kai Chen,
Wayne Zhang, Dahua Lin
- Abstract summary: MMOCR is an open-source toolbox for text detection and recognition.
It implements 14 state-of-the-art algorithms, which is more than all the existing open-source OCR projects we are aware of to date.
- Score: 70.16678926775475
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present MMOCR-an open-source toolbox which provides a comprehensive
pipeline for text detection and recognition, as well as their downstream tasks
such as named entity recognition and key information extraction. MMOCR
implements 14 state-of-the-art algorithms, which is significantly more than all
the existing open-source OCR projects we are aware of to date. To facilitate
future research and industrial applications of text recognition-related
problems, we also provide a large number of trained models and detailed
benchmarks to give insights into the performance of text detection, recognition
and understanding. MMOCR is publicly released at
https://github.com/open-mmlab/mmocr.
Related papers
- See then Tell: Enhancing Key Information Extraction with Vision Grounding [54.061203106565706]
We introduce STNet (See then Tell Net), a novel end-to-end model designed to deliver precise answers with relevant vision grounding.
To enhance the model's seeing capabilities, we collect extensive structured table recognition datasets.
arXiv Detail & Related papers (2024-09-29T06:21:05Z) - Exploring the Capabilities of Large Multimodal Models on Dense Text [58.82262549456294]
We propose the DT-VQA dataset, with 170k question-answer pairs.
In this paper, we conduct a comprehensive evaluation of GPT4V, Gemini, and various open-source LMMs.
We find that even with automatically labeled training datasets, significant improvements in model performance can be achieved.
arXiv Detail & Related papers (2024-05-09T07:47:25Z) - DLoRA-TrOCR: Mixed Text Mode Optical Character Recognition Based On Transformer [12.966765239586994]
Multi- fonts, mixed scenes and complex layouts seriously affect the recognition accuracy of traditional OCR models.
We propose a parameter-efficient mixed text recognition method based on pre-trained OCR Transformer, namely DLoRA-TrOCR.
arXiv Detail & Related papers (2024-04-19T09:28:16Z) - Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and
In-depth Evaluation [33.66939971907121]
The evaluation reveals that GPT-4V performs well in recognizing and understanding Latin contents, but struggles with multilingual scenarios and complex tasks.
In general, despite its versatility in handling diverse OCR tasks, GPT-4V does not outperform existing state-of-the-art OCR models.
arXiv Detail & Related papers (2023-10-25T17:38:55Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks.
OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z) - Synergistic Interplay between Search and Large Language Models for
Information Retrieval [141.18083677333848]
InteR allows RMs to expand knowledge in queries using LLM-generated knowledge collections.
InteR achieves overall superior zero-shot retrieval performance compared to state-of-the-art methods.
arXiv Detail & Related papers (2023-05-12T11:58:15Z) - PICK: Processing Key Information Extraction from Documents using
Improved Graph Learning-Convolutional Networks [5.210482046387142]
Key Information Extraction from documents remains a challenge.
We introduce PICK, a framework that is effective and robust in handling complex documents layout for KIE.
Our method outperforms baselines methods by significant margins.
arXiv Detail & Related papers (2020-04-16T05:20:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.