I2SRM: Intra- and Inter-Sample Relationship Modeling for Multimodal
Information Extraction
- URL: http://arxiv.org/abs/2310.06326v1
- Date: Tue, 10 Oct 2023 05:50:25 GMT
- Title: I2SRM: Intra- and Inter-Sample Relationship Modeling for Multimodal
Information Extraction
- Authors: Yusheng Huang, Zhouhan Lin
- Abstract summary: We present the Intra- and Inter-Sample Relationship Modeling (I2SRM) method for this task.
Our proposed method achieves competitive results, 77.12% F1-score on Twitter-2015, 88.40% F1-score on Twitter-2017, and 84.12% F1-score on MNRE.
- Score: 10.684005956288347
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal information extraction is attracting research attention nowadays,
which requires aggregating representations from different modalities. In this
paper, we present the Intra- and Inter-Sample Relationship Modeling (I2SRM)
method for this task, which contains two modules. Firstly, the intra-sample
relationship modeling module operates on a single sample and aims to learn
effective representations. Embeddings from textual and visual modalities are
shifted to bridge the modality gap caused by distinct pre-trained language and
image models. Secondly, the inter-sample relationship modeling module considers
relationships among multiple samples and focuses on capturing the interactions.
An AttnMixup strategy is proposed, which not only enables collaboration among
samples but also augments data to improve generalization. We conduct extensive
experiments on the multimodal named entity recognition datasets Twitter-2015
and Twitter-2017, and the multimodal relation extraction dataset MNRE. Our
proposed method I2SRM achieves competitive results, 77.12% F1-score on
Twitter-2015, 88.40% F1-score on Twitter-2017, and 84.12% F1-score on MNRE.
Related papers
- DHRNet: A Dual-Path Hierarchical Relation Network for Multi-Person Pose Estimation [14.267849773487834]
Multi-person pose estimation (MPPE) presents a formidable yet crucial challenge in computer vision.
This paper introduces a novel CNN-based single-stage method, named Dual-path Hierarchical Relation Network (DHRNet), to extract instance-to-joint and joint-to-instance interactions concurrently.
arXiv Detail & Related papers (2024-04-22T09:41:03Z) - Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z) - Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications [90.6849884683226]
We study the challenge of interaction quantification in a semi-supervised setting with only labeled unimodal data.
Using a precise information-theoretic definition of interactions, our key contribution is the derivation of lower and upper bounds.
We show how these theoretical results can be used to estimate multimodal model performance, guide data collection, and select appropriate multimodal models for various tasks.
arXiv Detail & Related papers (2023-06-07T15:44:53Z) - Align and Attend: Multimodal Summarization with Dual Contrastive Losses [57.83012574678091]
The goal of multimodal summarization is to extract the most important information from different modalities to form output summaries.
Existing methods fail to leverage the temporal correspondence between different modalities and ignore the intrinsic correlation between different samples.
We introduce Align and Attend Multimodal Summarization (A2Summ), a unified multimodal transformer-based model which can effectively align and attend the multimodal input.
arXiv Detail & Related papers (2023-03-13T17:01:42Z) - Hierarchical Cross-Modality Semantic Correlation Learning Model for
Multimodal Summarization [4.714335699701277]
Multimodal summarization with multimodal output (MSMO) generates a summary with both textual and visual content.
Traditional MSMO methods indistinguishably handle different modalities of data by learning a representation for the whole data.
We propose a hierarchical cross-modality semantic correlation learning model (HCSCL) to learn the intra- and inter-modal correlation existing in the multimodal data.
arXiv Detail & Related papers (2021-12-16T01:46:30Z) - Abstractive Sentence Summarization with Guidance of Selective Multimodal
Reference [3.505062507621494]
We propose a Multimodal Hierarchical Selective Transformer (mhsf) model that considers reciprocal relationships among modalities.
We evaluate the generalism of proposed mhsf model with the pre-trained+fine-tuning and fresh training strategies.
arXiv Detail & Related papers (2021-08-11T09:59:34Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z) - Learning Multimodal VAEs through Mutual Supervision [72.77685889312889]
MEME combines information between modalities implicitly through mutual supervision.
We demonstrate that MEME outperforms baselines on standard metrics across both partial and complete observation schemes.
arXiv Detail & Related papers (2021-06-23T17:54:35Z) - Self-Supervised Multimodal Domino: in Search of Biomarkers for
Alzheimer's Disease [19.86082635340699]
We propose a taxonomy of all reasonable ways to organize self-supervised representation-learning algorithms.
We first evaluate models on toy multimodal MNIST datasets and then apply them to a multimodal neuroimaging dataset with Alzheimer's disease patients.
Results show that the proposed approach outperforms previous self-supervised encoder-decoder methods.
arXiv Detail & Related papers (2020-12-25T20:28:13Z) - Relating by Contrasting: A Data-efficient Framework for Multimodal
Generative Models [86.9292779620645]
We develop a contrastive framework for generative model learning, allowing us to train the model not just by the commonality between modalities, but by the distinction between "related" and "unrelated" multimodal data.
Under our proposed framework, the generative model can accurately identify related samples from unrelated ones, making it possible to make use of the plentiful unlabeled, unpaired multimodal data.
arXiv Detail & Related papers (2020-07-02T15:08:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.