Hierarchical Cross-Modality Semantic Correlation Learning Model for
Multimodal Summarization
- URL: http://arxiv.org/abs/2112.12072v1
- Date: Thu, 16 Dec 2021 01:46:30 GMT
- Title: Hierarchical Cross-Modality Semantic Correlation Learning Model for
Multimodal Summarization
- Authors: Litian Zhang, Xiaoming Zhang, Junshu Pan, Feiran Huang
- Abstract summary: Multimodal summarization with multimodal output (MSMO) generates a summary with both textual and visual content.
Traditional MSMO methods indistinguishably handle different modalities of data by learning a representation for the whole data.
We propose a hierarchical cross-modality semantic correlation learning model (HCSCL) to learn the intra- and inter-modal correlation existing in the multimodal data.
- Score: 4.714335699701277
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal summarization with multimodal output (MSMO) generates a summary
with both textual and visual content. Multimodal news report contains
heterogeneous contents, which makes MSMO nontrivial. Moreover, it is observed
that different modalities of data in the news report correlate hierarchically.
Traditional MSMO methods indistinguishably handle different modalities of data
by learning a representation for the whole data, which is not directly
adaptable to the heterogeneous contents and hierarchical correlation. In this
paper, we propose a hierarchical cross-modality semantic correlation learning
model (HCSCL) to learn the intra- and inter-modal correlation existing in the
multimodal data. HCSCL adopts a graph network to encode the intra-modal
correlation. Then, a hierarchical fusion framework is proposed to learn the
hierarchical correlation between text and images. Furthermore, we construct a
new dataset with relevant image annotation and image object label information
to provide the supervision information for the learning procedure. Extensive
experiments on the dataset show that HCSCL significantly outperforms the
baseline methods in automatic summarization metrics and fine-grained diversity
tests.
Related papers
- Leveraging Entity Information for Cross-Modality Correlation Learning: The Entity-Guided Multimodal Summarization [49.08348604716746]
Multimodal Summarization with Multimodal Output (MSMO) aims to produce a multimodal summary that integrates both text and relevant images.
In this paper, we propose an Entity-Guided Multimodal Summarization model (EGMS)
Our model, building on BART, utilizes dual multimodal encoders with shared weights to process text-image and entity-image information concurrently.
arXiv Detail & Related papers (2024-08-06T12:45:56Z) - I2SRM: Intra- and Inter-Sample Relationship Modeling for Multimodal
Information Extraction [10.684005956288347]
We present the Intra- and Inter-Sample Relationship Modeling (I2SRM) method for this task.
Our proposed method achieves competitive results, 77.12% F1-score on Twitter-2015, 88.40% F1-score on Twitter-2017, and 84.12% F1-score on MNRE.
arXiv Detail & Related papers (2023-10-10T05:50:25Z) - Preserving Modality Structure Improves Multi-Modal Learning [64.10085674834252]
Self-supervised learning on large-scale multi-modal datasets allows learning semantically meaningful embeddings without relying on human annotations.
These methods often struggle to generalize well on out-of-domain data as they ignore the semantic structure present in modality-specific embeddings.
We propose a novel Semantic-Structure-Preserving Consistency approach to improve generalizability by preserving the modality-specific relationships in the joint embedding space.
arXiv Detail & Related papers (2023-08-24T20:46:48Z) - Knowledge-Enhanced Hierarchical Information Correlation Learning for
Multi-Modal Rumor Detection [82.94413676131545]
We propose a novel knowledge-enhanced hierarchical information correlation learning approach (KhiCL) for multi-modal rumor detection.
KhiCL exploits cross-modal joint dictionary to transfer the heterogeneous unimodality features into the common feature space.
It extracts visual and textual entities from images and text, and designs a knowledge relevance reasoning strategy.
arXiv Detail & Related papers (2023-06-28T06:08:20Z) - CISum: Learning Cross-modality Interaction to Enhance Multimodal
Semantic Coverage for Multimodal Summarization [2.461695698601437]
This paper proposes a multi-task cross-modality learning framework (CISum) to improve multimodal semantic coverage.
To obtain the visual semantics, we translate images into visual descriptions based on the correlation with text content.
Then, the visual description and text content are fused to generate the textual summary to capture the semantics of the multimodal content.
arXiv Detail & Related papers (2023-02-20T11:57:23Z) - Multi-modal Contrastive Representation Learning for Entity Alignment [57.92705405276161]
Multi-modal entity alignment aims to identify equivalent entities between two different multi-modal knowledge graphs.
We propose MCLEA, a Multi-modal Contrastive Learning based Entity Alignment model.
In particular, MCLEA firstly learns multiple individual representations from multiple modalities, and then performs contrastive learning to jointly model intra-modal and inter-modal interactions.
arXiv Detail & Related papers (2022-09-02T08:59:57Z) - VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal
Document Classification [3.7798600249187295]
Multimodal learning from document data has achieved great success lately as it allows to pre-train semantically meaningful features as a prior into a learnable downstream task.
In this paper, we approach the document classification problem by learning cross-modal representations through language and vision cues.
The proposed method exploits high-level interactions and learns relevant semantic information from effective attention flows within and across modalities.
arXiv Detail & Related papers (2022-05-24T12:28:12Z) - Support-set based Multi-modal Representation Enhancement for Video
Captioning [121.70886789958799]
We propose a Support-set based Multi-modal Representation Enhancement (SMRE) model to mine rich information in a semantic subspace shared between samples.
Specifically, we propose a Support-set Construction (SC) module to construct a support-set to learn underlying connections between samples and obtain semantic-related visual elements.
During this process, we design a Semantic Space Transformation (SST) module to constrain relative distance and administrate multi-modal interactions in a self-supervised way.
arXiv Detail & Related papers (2022-05-19T03:40:29Z) - Learning Multimodal VAEs through Mutual Supervision [72.77685889312889]
MEME combines information between modalities implicitly through mutual supervision.
We demonstrate that MEME outperforms baselines on standard metrics across both partial and complete observation schemes.
arXiv Detail & Related papers (2021-06-23T17:54:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.