Related papers: Abstractive Sentence Summarization with Guidance of Selective Multimodal Reference

Abstractive Sentence Summarization with Guidance of Selective Multimodal Reference

URL: http://arxiv.org/abs/2108.05123v1
Date: Wed, 11 Aug 2021 09:59:34 GMT
Title: Abstractive Sentence Summarization with Guidance of Selective Multimodal Reference
Authors: Zijian Zhang, Chenxi Zhang, Qinpei Zhao, Jiangfeng Li
Abstract summary: We propose a Multimodal Hierarchical Selective Transformer (mhsf) model that considers reciprocal relationships among modalities. We evaluate the generalism of proposed mhsf model with the pre-trained+fine-tuning and fresh training strategies.
Score: 3.505062507621494
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal abstractive summarization with sentence output is to generate a textual summary given a multimodal triad -- sentence, image and audio, which has been proven to improve users satisfaction and convenient our life. Existing approaches mainly focus on the enhancement of multimodal fusion, while ignoring the unalignment among multiple inputs and the emphasis of different segments in feature, which has resulted in the superfluity of multimodal interaction. To alleviate these problems, we propose a Multimodal Hierarchical Selective Transformer (mhsf) model that considers reciprocal relationships among modalities (by low-level cross-modal interaction module) and respective characteristics within single fusion feature (by high-level selective routing module). In details, it firstly aligns the inputs from different sources and then adopts a divide and conquer strategy to highlight or de-emphasize multimodal fusion representation, which can be seen as a sparsely feed-forward model - different groups of parameters will be activated facing different segments in feature. We evaluate the generalism of proposed mhsf model with the pre-trained+fine-tuning and fresh training strategies. And Further experimental results on MSMO demonstrate that our model outperforms SOTA baselines in terms of ROUGE, relevance scores and human evaluation.

Related papers

IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification [60.38841251693781]
We propose a novel framework to generate robust multi-modal object ReIDs. Our framework uses Modal Prefixes and InverseNet to integrate multi-modal information with semantic guidance from inverted text. Experiments on three multi-modal object ReID benchmarks demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2025-03-13T13:00:31Z)
Towards Bridging the Cross-modal Semantic Gap for Multi-modal Recommendation [12.306686291299146]
Multi-modal recommendation greatly enhances the performance of recommender systems. Most existing multi-modal recommendation models exploit multimedia information propagation processes to enrich item representations. We propose a novel framework to bridge the semantic gap between modalities and extract fine-grained multi-view semantic information.
arXiv Detail & Related papers (2024-07-07T15:56:03Z)
Asynchronous Multimodal Video Sequence Fusion via Learning Modality-Exclusive and -Agnostic Representations [19.731611716111566]
We propose a Multimodal fusion approach for learning modality-Exclusive and modality-Agnostic representations. We introduce a predictive self-attention module to capture reliable context dynamics within modalities. A hierarchical cross-modal attention module is designed to explore valuable element correlations among modalities. A double-discriminator strategy is presented to ensure the production of distinct representations in an adversarial manner.
arXiv Detail & Related papers (2024-07-06T04:36:48Z)
U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics. We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features. Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z)
Model Composition for Multimodal Large Language Models [71.5729418523411]
We propose a new paradigm through the model composition of existing MLLMs to create a new model that retains the modal understanding capabilities of each original model. Our basic implementation, NaiveMC, demonstrates the effectiveness of this paradigm by reusing modality encoders and merging LLM parameters.
arXiv Detail & Related papers (2024-02-20T06:38:10Z)
Improving Multimodal Sentiment Analysis: Supervised Angular Margin-based Contrastive Learning for Enhanced Fusion Representation [10.44888349041063]
We introduce a framework called Supervised Angular-based Contrastive Learning for Multimodal Sentiment Analysis. This framework aims to enhance discrimination and generalizability of the multimodal representation and overcome biases in the fusion vector's modality.
arXiv Detail & Related papers (2023-12-04T02:58:19Z)
Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding. We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL. UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
A Self-Adjusting Fusion Representation Learning Model for Unaligned Text-Audio Sequences [16.38826799727453]
How to integrate relevant information of each modality to learn fusion representations has been one of the central challenges in multimodal learning. In this paper, a Self-Adjusting Fusion Representation Learning Model is proposed to learn robust crossmodal fusion representations directly from the unaligned text and audio sequences. Experiment results show that our model has significantly improved the performance of all the metrics on the unaligned text-audio sequences.
arXiv Detail & Related papers (2022-11-12T13:05:28Z)
Multimodal E-Commerce Product Classification Using Hierarchical Fusion [0.0]
The proposed method significantly outperformed the unimodal models' performance and the reported performance of similar models on our specific task. We did experiments with multiple fusing techniques and found, that the best performing technique to combine the individual embedding of the unimodal network is based on combining concatenation and averaging the feature vectors.
arXiv Detail & Related papers (2022-07-07T14:04:42Z)
Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations. Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z)
Attention Bottlenecks for Multimodal Fusion [90.75885715478054]
Machine perception models are typically modality-specific and optimised for unimodal benchmarks. We introduce a novel transformer based architecture that uses fusion' for modality fusion at multiple layers. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks.
arXiv Detail & Related papers (2021-06-30T22:44:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.