Vision-Enhanced Semantic Entity Recognition in Document Images via
Visually-Asymmetric Consistency Learning
- URL: http://arxiv.org/abs/2310.14785v1
- Date: Mon, 23 Oct 2023 10:37:22 GMT
- Title: Vision-Enhanced Semantic Entity Recognition in Document Images via
Visually-Asymmetric Consistency Learning
- Authors: Hao Wang, Xiahua Chen, Rui Wang, Chenhui Chu
- Abstract summary: Existing models commonly train a visual encoder with weak cross-modal supervision signals.
We propose a novel textbfVisually-textbfAsymmetric cotextbfNsistentextbfCy textbfLearning (textscVancl) approach to capture fine-grained visual and layout features.
- Score: 19.28860833813788
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Extracting meaningful entities belonging to predefined categories from
Visually-rich Form-like Documents (VFDs) is a challenging task. Visual and
layout features such as font, background, color, and bounding box location and
size provide important cues for identifying entities of the same type. However,
existing models commonly train a visual encoder with weak cross-modal
supervision signals, resulting in a limited capacity to capture these
non-textual features and suboptimal performance. In this paper, we propose a
novel \textbf{V}isually-\textbf{A}symmetric co\textbf{N}sisten\textbf{C}y
\textbf{L}earning (\textsc{Vancl}) approach that addresses the above limitation
by enhancing the model's ability to capture fine-grained visual and layout
features through the incorporation of color priors. Experimental results on
benchmark datasets show that our approach substantially outperforms the strong
LayoutLM series baseline, demonstrating the effectiveness of our approach.
Additionally, we investigate the effects of different color schemes on our
approach, providing insights for optimizing model performance. We believe our
work will inspire future research on multimodal information extraction.
Related papers
- A Simple Background Augmentation Method for Object Detection with Diffusion Model [53.32935683257045]
In computer vision, it is well-known that a lack of data diversity will impair model performance.
We propose a simple yet effective data augmentation approach by leveraging advancements in generative models.
Background augmentation, in particular, significantly improves the models' robustness and generalization capabilities.
arXiv Detail & Related papers (2024-08-01T07:40:00Z) - Calibrated Self-Rewarding Vision Language Models [27.686545023186852]
Large Vision-Language Models (LVLMs) have made substantial progress by integrating pre-trained large language models (LLMs) and vision models through instruction tuning.
LVLMs often exhibit the hallucination phenomenon, where generated text responses appear linguistically plausible but contradict the input image.
We propose the Calibrated Self-Rewarding (CSR) approach, which enables the model to self-improve by iteratively generating candidate responses, evaluating the reward for each response, and curating preference data for fine-tuning.
arXiv Detail & Related papers (2024-05-23T14:30:33Z) - Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary.
We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust.
Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model.
We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - Good Visual Guidance Makes A Better Extractor: Hierarchical Visual
Prefix for Multimodal Entity and Relation Extraction [88.6585431949086]
We propose a novel Hierarchical Visual Prefix fusion NeTwork (HVPNeT) for visual-enhanced entity and relation extraction.
We regard visual representation as pluggable visual prefix to guide the textual representation for error insensitive forecasting decision.
Experiments on three benchmark datasets demonstrate the effectiveness of our method, and achieve state-of-the-art performance.
arXiv Detail & Related papers (2022-05-07T02:10:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.