On Analyzing the Role of Image for Visual-enhanced Relation Extraction
- URL: http://arxiv.org/abs/2211.07504v1
- Date: Mon, 14 Nov 2022 16:39:24 GMT
- Title: On Analyzing the Role of Image for Visual-enhanced Relation Extraction
- Authors: Lei Li, Xiang Chen, Shuofei Qiao, Feiyu Xiong, Huajun Chen, Ningyu
Zhang
- Abstract summary: In this paper, we take an in-depth empirical analysis that indicates the inaccurate information in the visual scene graph leads to poor modal alignment weights.
We propose a strong baseline with an implicit fine-grained multimodal alignment based on Transformer for multimodal relation extraction.
- Score: 36.84650189600189
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal relation extraction is an essential task for knowledge graph
construction. In this paper, we take an in-depth empirical analysis that
indicates the inaccurate information in the visual scene graph leads to poor
modal alignment weights, further degrading performance. Moreover, the visual
shuffle experiments illustrate that the current approaches may not take full
advantage of visual information. Based on the above observation, we further
propose a strong baseline with an implicit fine-grained multimodal alignment
based on Transformer for multimodal relation extraction. Experimental results
demonstrate the better performance of our method. Codes are available at
https://github.com/zjunlp/DeepKE/tree/main/example/re/multimodal.
Related papers
- Multimodal Information Bottleneck for Deep Reinforcement Learning with Multiple Sensors [10.454194186065195]
Reinforcement learning has achieved promising results on robotic control tasks but struggles to leverage information effectively.
Recent works construct auxiliary losses based on reconstruction or mutual information to extract joint representations from multiple sensory inputs.
We argue that compressing information in the learned joint representations about raw multimodal observations is helpful.
arXiv Detail & Related papers (2024-10-23T04:32:37Z) - Towards Robust and Accurate Visual Prompting [11.918195429308035]
We study whether a visual prompt derived from a robust model can inherit the robustness while suffering from the generalization performance decline.
We introduce a novel technique named Prompt Boundary Loose (PBL) to effectively mitigates the suboptimal results of visual prompt on standard accuracy.
Our findings are universal and demonstrate the significant benefits of our proposed method.
arXiv Detail & Related papers (2023-11-18T07:00:56Z) - Vision-Enhanced Semantic Entity Recognition in Document Images via
Visually-Asymmetric Consistency Learning [19.28860833813788]
Existing models commonly train a visual encoder with weak cross-modal supervision signals.
We propose a novel textbfVisually-textbfAsymmetric cotextbfNsistentextbfCy textbfLearning (textscVancl) approach to capture fine-grained visual and layout features.
arXiv Detail & Related papers (2023-10-23T10:37:22Z) - Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining [25.11384964373604]
We propose two pretraining approaches to contextualise visual entities in a multimodal setup.
With verbalised scene graphs, we transform visual relation triplets into structured captions, and treat them as additional image descriptions.
With masked relation prediction, we further encourage relating entities from image regions with visually masked contexts.
arXiv Detail & Related papers (2023-05-23T17:27:12Z) - Learnable Pillar-based Re-ranking for Image-Text Retrieval [119.9979224297237]
Image-text retrieval aims to bridge the modality gap and retrieve cross-modal content based on semantic similarities.
Re-ranking, a popular post-processing practice, has revealed the superiority of capturing neighbor relations in single-modality retrieval tasks.
We propose a novel learnable pillar-based re-ranking paradigm for image-text retrieval.
arXiv Detail & Related papers (2023-04-25T04:33:27Z) - Good Visual Guidance Makes A Better Extractor: Hierarchical Visual
Prefix for Multimodal Entity and Relation Extraction [88.6585431949086]
We propose a novel Hierarchical Visual Prefix fusion NeTwork (HVPNeT) for visual-enhanced entity and relation extraction.
We regard visual representation as pluggable visual prefix to guide the textual representation for error insensitive forecasting decision.
Experiments on three benchmark datasets demonstrate the effectiveness of our method, and achieve state-of-the-art performance.
arXiv Detail & Related papers (2022-05-07T02:10:55Z) - MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase
Grounding [74.33171794972688]
We present algorithms to model phrase-object relevance by leveraging fine-grained visual representations and visually-aware language representations.
Experiments conducted on the widely-adopted Flickr30k dataset show a significant improvement over existing weakly-supervised methods.
arXiv Detail & Related papers (2020-10-12T00:43:52Z) - Adaptive Context-Aware Multi-Modal Network for Depth Completion [107.15344488719322]
We propose to adopt the graph propagation to capture the observed spatial contexts.
We then apply the attention mechanism on the propagation, which encourages the network to model the contextual information adaptively.
Finally, we introduce the symmetric gated fusion strategy to exploit the extracted multi-modal features effectively.
Our model, named Adaptive Context-Aware Multi-Modal Network (ACMNet), achieves the state-of-the-art performance on two benchmarks.
arXiv Detail & Related papers (2020-08-25T06:00:06Z) - Relation-Guided Representation Learning [53.60351496449232]
We propose a new representation learning method that explicitly models and leverages sample relations.
Our framework well preserves the relations between samples.
By seeking to embed samples into subspace, we show that our method can address the large-scale and out-of-sample problem.
arXiv Detail & Related papers (2020-07-11T10:57:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.