Multi-source Semantic Graph-based Multimodal Sarcasm Explanation
Generation
- URL: http://arxiv.org/abs/2306.16650v1
- Date: Thu, 29 Jun 2023 03:26:10 GMT
- Title: Multi-source Semantic Graph-based Multimodal Sarcasm Explanation
Generation
- Authors: Liqiang Jing, Xuemeng Song, Kun Ouyang, Mengzhao Jia, Liqiang Nie
- Abstract summary: We propose a novel mulTi-source sEmantic grAph-based Multimodal sarcasm explanation scheme, named TEAM.
TEAM extracts the object-level semantic meta-data instead of the traditional global visual features from the input image.
TEAM introduces a multi-source semantic graph that comprehensively characterize the multi-source semantic relations.
- Score: 53.97962603641629
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Sarcasm Explanation (MuSE) is a new yet challenging task, which
aims to generate a natural language sentence for a multimodal social post (an
image as well as its caption) to explain why it contains sarcasm. Although the
existing pioneer study has achieved great success with the BART backbone, it
overlooks the gap between the visual feature space and the decoder semantic
space, the object-level metadata of the image, as well as the potential
external knowledge. To solve these limitations, in this work, we propose a
novel mulTi-source sEmantic grAph-based Multimodal sarcasm explanation scheme,
named TEAM. In particular, TEAM extracts the object-level semantic meta-data
instead of the traditional global visual features from the input image.
Meanwhile, TEAM resorts to ConceptNet to obtain the external related knowledge
concepts for the input text and the extracted object meta-data. Thereafter,
TEAM introduces a multi-source semantic graph that comprehensively characterize
the multi-source (i.e., caption, object meta-data, external knowledge) semantic
relations to facilitate the sarcasm reasoning. Extensive experiments on a
public released dataset MORE verify the superiority of our model over
cutting-edge methods.
Related papers
- Semantic Alignment for Multimodal Large Language Models [72.10272479476161]
We introduce Semantic Alignment for Multi-modal large language models (SAM)
By involving the bidirectional semantic guidance between different images in the visual-token extraction process, SAM aims to enhance the preservation of linking information for coherent analysis.
By involving the bidirectional semantic guidance between different images in the visual-token extraction process, SAM aims to enhance the preservation of linking information for coherent analysis.
arXiv Detail & Related papers (2024-08-23T06:48:46Z) - ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
Attribute-based Multimodal Data Augmentation (ARMADA) is a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes.
ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation.
This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
arXiv Detail & Related papers (2024-08-19T15:27:25Z) - Leveraging Entity Information for Cross-Modality Correlation Learning: The Entity-Guided Multimodal Summarization [49.08348604716746]
Multimodal Summarization with Multimodal Output (MSMO) aims to produce a multimodal summary that integrates both text and relevant images.
In this paper, we propose an Entity-Guided Multimodal Summarization model (EGMS)
Our model, building on BART, utilizes dual multimodal encoders with shared weights to process text-image and entity-image information concurrently.
arXiv Detail & Related papers (2024-08-06T12:45:56Z) - Information Screening whilst Exploiting! Multimodal Relation Extraction
with Feature Denoising and Multimodal Topic Modeling [96.75821232222201]
Existing research on multimodal relation extraction (MRE) faces two co-existing challenges, internal-information over-utilization and external-information under-exploitation.
We propose a novel framework that simultaneously implements the idea of internal-information screening and external-information exploiting.
arXiv Detail & Related papers (2023-05-19T14:56:57Z) - CISum: Learning Cross-modality Interaction to Enhance Multimodal
Semantic Coverage for Multimodal Summarization [2.461695698601437]
This paper proposes a multi-task cross-modality learning framework (CISum) to improve multimodal semantic coverage.
To obtain the visual semantics, we translate images into visual descriptions based on the correlation with text content.
Then, the visual description and text content are fused to generate the textual summary to capture the semantics of the multimodal content.
arXiv Detail & Related papers (2023-02-20T11:57:23Z) - UniMS: A Unified Framework for Multimodal Summarization with Knowledge
Distillation [43.15662489492694]
We propose a Unified framework for Multimodal Summarization grounding on BART, UniMS.
We adopt knowledge distillation from a vision-language pretrained model to improve image selection.
Our best model achieves a new state-of-the-art result on a large-scale benchmark dataset.
arXiv Detail & Related papers (2021-09-13T09:36:04Z) - FiLMing Multimodal Sarcasm Detection with Attention [0.7340017786387767]
Sarcasm detection identifies natural language expressions whose intended meaning is different from what is implied by its surface meaning.
We propose a novel architecture that uses the RoBERTa model with a co-attention layer on top to incorporate context incongruity between input text and image attributes.
Our results demonstrate that our proposed model outperforms the existing state-of-the-art method by 6.14% F1 score on the public Twitter multimodal detection dataset.
arXiv Detail & Related papers (2021-08-09T06:33:29Z) - A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine
Translation [131.33610549540043]
We propose a novel graph-based multi-modal fusion encoder for NMT.
We first represent the input sentence and image using a unified multi-modal graph.
We then stack multiple graph-based multi-modal fusion layers that iteratively perform semantic interactions to learn node representations.
arXiv Detail & Related papers (2020-07-17T04:06:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.