Visual Information Extraction in the Wild: Practical Dataset and
End-to-end Solution
- URL: http://arxiv.org/abs/2305.07498v2
- Date: Thu, 15 Jun 2023 03:31:12 GMT
- Title: Visual Information Extraction in the Wild: Practical Dataset and
End-to-end Solution
- Authors: Jianfeng Kuang, Wei Hua, Dingkang Liang, Mingkun Yang, Deqiang Jiang,
Bo Ren, and Xiang Bai
- Abstract summary: We propose a large-scale dataset consisting of camera images for visual information extraction (VIE)
We propose a novel framework for end-to-end VIE that combines the stages of OCR and information extraction in an end-to-end learning fashion.
We evaluate the existing end-to-end methods for VIE on the proposed dataset and observe that the performance of these methods has a distinguishable drop from SROIE to our proposed dataset due to the larger variance of layout and entities.
- Score: 48.693941280097974
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual information extraction (VIE), which aims to simultaneously perform OCR
and information extraction in a unified framework, has drawn increasing
attention due to its essential role in various applications like understanding
receipts, goods, and traffic signs. However, as existing benchmark datasets for
VIE mainly consist of document images without the adequate diversity of layout
structures, background disturbs, and entity categories, they cannot fully
reveal the challenges of real-world applications. In this paper, we propose a
large-scale dataset consisting of camera images for VIE, which contains not
only the larger variance of layout, backgrounds, and fonts but also much more
types of entities. Besides, we propose a novel framework for end-to-end VIE
that combines the stages of OCR and information extraction in an end-to-end
learning fashion. Different from the previous end-to-end approaches that
directly adopt OCR features as the input of an information extraction module,
we propose to use contrastive learning to narrow the semantic gap caused by the
difference between the tasks of OCR and information extraction. We evaluate the
existing end-to-end methods for VIE on the proposed dataset and observe that
the performance of these methods has a distinguishable drop from SROIE (a
widely used English dataset) to our proposed dataset due to the larger variance
of layout and entities. These results demonstrate our dataset is more practical
for promoting advanced VIE algorithms. In addition, experiments demonstrate
that the proposed VIE method consistently achieves the obvious performance
gains on the proposed and SROIE datasets.
Related papers
- Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing.
Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery.
We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z) - Vision-Enhanced Semantic Entity Recognition in Document Images via
Visually-Asymmetric Consistency Learning [19.28860833813788]
Existing models commonly train a visual encoder with weak cross-modal supervision signals.
We propose a novel textbfVisually-textbfAsymmetric cotextbfNsistentextbfCy textbfLearning (textscVancl) approach to capture fine-grained visual and layout features.
arXiv Detail & Related papers (2023-10-23T10:37:22Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - infoVerse: A Universal Framework for Dataset Characterization with
Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization.
infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information.
In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z) - Modeling Entities as Semantic Points for Visual Information Extraction
in the Wild [55.91783742370978]
We propose an alternative approach to precisely and robustly extract key information from document images.
We explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities.
The proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-23T08:21:16Z) - Salient Objects in Clutter [130.63976772770368]
This paper identifies and addresses a serious design bias of existing salient object detection (SOD) datasets.
This design bias has led to a saturation in performance for state-of-the-art SOD models when evaluated on existing datasets.
We propose a new high-quality dataset and update the previous saliency benchmark.
arXiv Detail & Related papers (2021-05-07T03:49:26Z) - Towards Robust Visual Information Extraction in Real World: New Dataset
and Novel Solution [30.438041837029875]
We propose a robust visual information extraction system (VIES) towards real-world scenarios.
VIES is a unified end-to-end trainable framework for simultaneous text detection, recognition and information extraction.
We construct a fully-annotated dataset called EPHOIE, which is the first Chinese benchmark for both text spotting and visual information extraction.
arXiv Detail & Related papers (2021-01-24T11:05:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.