Enhancing Multimodal Entity and Relation Extraction with Variational
Information Bottleneck
- URL: http://arxiv.org/abs/2304.02328v1
- Date: Wed, 5 Apr 2023 09:32:25 GMT
- Title: Enhancing Multimodal Entity and Relation Extraction with Variational
Information Bottleneck
- Authors: Shiyao Cui, Jiangxia Cao, Xin Cong, Jiawei Sheng, Quangang Li, Tingwen
Liu, Jinqiao Shi
- Abstract summary: We study the multimodal named entity recognition (MNER) and multimodal relation extraction (MRE)
The core of MNER and MRE lies in incorporating evident visual information to enhance textual semantics.
We propose a novel method for MNER and MRE by Multi-Modal representation learning with Information Bottleneck (MMIB)
- Score: 12.957002659910456
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper studies the multimodal named entity recognition (MNER) and
multimodal relation extraction (MRE), which are important for multimedia social
platform analysis. The core of MNER and MRE lies in incorporating evident
visual information to enhance textual semantics, where two issues inherently
demand investigations. The first issue is modality-noise, where the
task-irrelevant information in each modality may be noises misleading the task
prediction. The second issue is modality-gap, where representations from
different modalities are inconsistent, preventing from building the semantic
alignment between the text and image. To address these issues, we propose a
novel method for MNER and MRE by Multi-Modal representation learning with
Information Bottleneck (MMIB). For the first issue, a refinement-regularizer
probes the information-bottleneck principle to balance the predictive evidence
and noisy information, yielding expressive representations for prediction. For
the second issue, an alignment-regularizer is proposed, where a mutual
information-based item works in a contrastive manner to regularize the
consistent text-image representations. To our best knowledge, we are the first
to explore variational IB estimation for MNER and MRE. Experiments show that
MMIB achieves the state-of-the-art performances on three public benchmarks.
Related papers
- What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian-Noise-free Text-Image Corruption and Evaluation [16.033361754660316]
Notice is the first Noise-free Text-Image Corruption and Evaluation pipeline for interpretability in Vision-Language Models (VLMs)
Our experiments on the SVO-Probes, MIT-States, and Facial Expression Recognition datasets reveal crucial insights into VLM decision-making.
This work paves the way for more transparent and interpretable multimodal systems.
arXiv Detail & Related papers (2024-06-24T05:13:19Z) - Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z) - Integrating Large Pre-trained Models into Multimodal Named Entity
Recognition with Evidential Fusion [31.234455370113075]
We propose incorporating uncertainty estimation into the MNER task, producing trustworthy predictions.
Our proposed algorithm models the distribution of each modality as a Normal-inverse Gamma distribution, and fuses them into a unified distribution.
Experiments on two datasets demonstrate that our proposed method outperforms the baselines and achieves new state-of-the-art performance.
arXiv Detail & Related papers (2023-06-29T14:50:23Z) - Multimodal Relation Extraction with Cross-Modal Retrieval and Synthesis [89.04041100520881]
This research proposes to retrieve textual and visual evidence based on the object, sentence, and whole image.
We develop a novel approach to synthesize the object-level, image-level, and sentence-level information for better reasoning between the same and different modalities.
arXiv Detail & Related papers (2023-05-25T15:26:13Z) - Information Screening whilst Exploiting! Multimodal Relation Extraction
with Feature Denoising and Multimodal Topic Modeling [96.75821232222201]
Existing research on multimodal relation extraction (MRE) faces two co-existing challenges, internal-information over-utilization and external-information under-exploitation.
We propose a novel framework that simultaneously implements the idea of internal-information screening and external-information exploiting.
arXiv Detail & Related papers (2023-05-19T14:56:57Z) - Hierarchical Aligned Multimodal Learning for NER on Tweet Posts [12.632808712127291]
multimodal named entity recognition (MNER) has attracted more attention.
We propose a novel approach, which can dynamically align the image and text sequence.
We conduct experiments on two open datasets, and the results and detailed analysis demonstrate the advantage of our model.
arXiv Detail & Related papers (2023-05-15T06:14:36Z) - VERITE: A Robust Benchmark for Multimodal Misinformation Detection
Accounting for Unimodal Bias [17.107961913114778]
multimodal misinformation is a growing problem on social media platforms.
In this study, we investigate and identify the presence of unimodal bias in widely-used MMD benchmarks.
We introduce a new method -- termed Crossmodal HArd Synthetic MisAlignment (CHASMA) -- for generating realistic synthetic training data.
arXiv Detail & Related papers (2023-04-27T12:28:29Z) - Named Entity and Relation Extraction with Multi-Modal Retrieval [51.660650522630526]
Multi-modal named entity recognition (NER) and relation extraction (RE) aim to leverage relevant image information to improve the performance of NER and RE.
We propose a novel Multi-modal Retrieval based framework (MoRe)
MoRe contains a text retrieval module and an image-based retrieval module, which retrieve related knowledge of the input text and image in the knowledge corpus respectively.
arXiv Detail & Related papers (2022-12-03T13:11:32Z) - Variational Distillation for Multi-View Learning [104.17551354374821]
We design several variational information bottlenecks to exploit two key characteristics for multi-view representation learning.
Under rigorously theoretical guarantee, our approach enables IB to grasp the intrinsic correlation between observations and semantic labels.
arXiv Detail & Related papers (2022-06-20T03:09:46Z) - Multi-Modal Mutual Information Maximization: A Novel Approach for
Unsupervised Deep Cross-Modal Hashing [73.29587731448345]
We propose a novel method, dubbed Cross-Modal Info-Max Hashing (CMIMH)
We learn informative representations that can preserve both intra- and inter-modal similarities.
The proposed method consistently outperforms other state-of-the-art cross-modal retrieval methods.
arXiv Detail & Related papers (2021-12-13T08:58:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.