Good Visual Guidance Makes A Better Extractor: Hierarchical Visual
Prefix for Multimodal Entity and Relation Extraction
- URL: http://arxiv.org/abs/2205.03521v1
- Date: Sat, 7 May 2022 02:10:55 GMT
- Title: Good Visual Guidance Makes A Better Extractor: Hierarchical Visual
Prefix for Multimodal Entity and Relation Extraction
- Authors: Xiang Chen, Ningyu Zhang, Lei Li, Yunzhi Yao, Shumin Deng, Chuanqi
Tan, Fei Huang, Luo Si, Huajun Chen
- Abstract summary: We propose a novel Hierarchical Visual Prefix fusion NeTwork (HVPNeT) for visual-enhanced entity and relation extraction.
We regard visual representation as pluggable visual prefix to guide the textual representation for error insensitive forecasting decision.
Experiments on three benchmark datasets demonstrate the effectiveness of our method, and achieve state-of-the-art performance.
- Score: 88.6585431949086
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal named entity recognition and relation extraction (MNER and MRE) is
a fundamental and crucial branch in information extraction. However, existing
approaches for MNER and MRE usually suffer from error sensitivity when
irrelevant object images incorporated in texts. To deal with these issues, we
propose a novel Hierarchical Visual Prefix fusion NeTwork (HVPNeT) for
visual-enhanced entity and relation extraction, aiming to achieve more
effective and robust performance. Specifically, we regard visual representation
as pluggable visual prefix to guide the textual representation for error
insensitive forecasting decision. We further propose a dynamic gated
aggregation strategy to achieve hierarchical multi-scaled visual features as
visual prefix for fusion. Extensive experiments on three benchmark datasets
demonstrate the effectiveness of our method, and achieve state-of-the-art
performance. Code is available in https://github.com/zjunlp/HVPNeT.
Related papers
- SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion [20.016192628108158]
Visual grounding is a common vision task that involves grounding descriptive sentences to the corresponding regions of an image.
Most existing methods use independent image-text encoding and apply complex hand-crafted modules or encoder-decoder architectures for modal interaction and query reasoning.
This is because the former paradigm only utilizes limited downstream data to fit the multi-modal feature fusion.
In this paper, we present a simple yet robust transformer-based framework, SimVG, for visual grounding.
arXiv Detail & Related papers (2024-09-26T04:36:19Z) - Vision-Enhanced Semantic Entity Recognition in Document Images via
Visually-Asymmetric Consistency Learning [19.28860833813788]
Existing models commonly train a visual encoder with weak cross-modal supervision signals.
We propose a novel textbfVisually-textbfAsymmetric cotextbfNsistentextbfCy textbfLearning (textscVancl) approach to capture fine-grained visual and layout features.
arXiv Detail & Related papers (2023-10-23T10:37:22Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Dual-Gated Fusion with Prefix-Tuning for Multi-Modal Relation Extraction [13.454953507205278]
Multi-Modal Relation Extraction aims at identifying the relation between two entities in texts that contain visual clues.
We propose a novel MMRE framework to better capture the deeper correlations of text, entity pair, and image/objects.
Our approach achieves excellent performance compared to strong competitors, even in the few-shot situation.
arXiv Detail & Related papers (2023-06-19T15:31:34Z) - Information Screening whilst Exploiting! Multimodal Relation Extraction
with Feature Denoising and Multimodal Topic Modeling [96.75821232222201]
Existing research on multimodal relation extraction (MRE) faces two co-existing challenges, internal-information over-utilization and external-information under-exploitation.
We propose a novel framework that simultaneously implements the idea of internal-information screening and external-information exploiting.
arXiv Detail & Related papers (2023-05-19T14:56:57Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - On Analyzing the Role of Image for Visual-enhanced Relation Extraction [36.84650189600189]
In this paper, we take an in-depth empirical analysis that indicates the inaccurate information in the visual scene graph leads to poor modal alignment weights.
We propose a strong baseline with an implicit fine-grained multimodal alignment based on Transformer for multimodal relation extraction.
arXiv Detail & Related papers (2022-11-14T16:39:24Z) - Bear the Query in Mind: Visual Grounding with Query-conditioned
Convolution [26.523051615516742]
We propose a Query-conditioned Convolution Module (QCM) that extracts query-aware visual features by incorporating query information into the generation of convolutional kernels.
Our method achieves state-of-the-art performance on three popular visual grounding datasets.
arXiv Detail & Related papers (2022-06-18T04:26:39Z) - Two-stage Visual Cues Enhancement Network for Referring Image
Segmentation [89.49412325699537]
Referring Image (RIS) aims at segmenting the target object from an image referred by one given natural language expression.
In this paper, we tackle this problem by devising a Two-stage Visual cues enhancement Network (TV-Net)
Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image.
arXiv Detail & Related papers (2021-10-09T02:53:39Z) - Language and Visual Entity Relationship Graph for Agent Navigation [54.059606864535304]
Vision-and-Language Navigation (VLN) requires an agent to navigate in a real-world environment following natural language instructions.
We propose a novel Language and Visual Entity Relationship Graph for modelling the inter-modal relationships between text and vision.
Experiments show that by taking advantage of the relationships we are able to improve over state-of-the-art.
arXiv Detail & Related papers (2020-10-19T08:25:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.