A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine
Translation
- URL: http://arxiv.org/abs/2007.08742v1
- Date: Fri, 17 Jul 2020 04:06:09 GMT
- Title: A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine
Translation
- Authors: Yongjing Yin, Fandong Meng, Jinsong Su, Chulun Zhou, Zhengyuan Yang,
Jie Zhou, Jiebo Luo
- Abstract summary: We propose a novel graph-based multi-modal fusion encoder for NMT.
We first represent the input sentence and image using a unified multi-modal graph.
We then stack multiple graph-based multi-modal fusion layers that iteratively perform semantic interactions to learn node representations.
- Score: 131.33610549540043
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-modal neural machine translation (NMT) aims to translate source
sentences into a target language paired with images. However, dominant
multi-modal NMT models do not fully exploit fine-grained semantic
correspondences between semantic units of different modalities, which have
potential to refine multi-modal representation learning. To deal with this
issue, in this paper, we propose a novel graph-based multi-modal fusion encoder
for NMT. Specifically, we first represent the input sentence and image using a
unified multi-modal graph, which captures various semantic relationships
between multi-modal semantic units (words and visual objects). We then stack
multiple graph-based multi-modal fusion layers that iteratively perform
semantic interactions to learn node representations. Finally, these
representations provide an attention-based context vector for the decoder. We
evaluate our proposed encoder on the Multi30K datasets. Experimental results
and in-depth analysis show the superiority of our multi-modal NMT model.
Related papers
- MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model [49.931663904599205]
MaVEn is an innovative framework designed to enhance the capabilities of Multimodal Large Language Models (MLLMs) in multi-image reasoning.
We show that MaVEn significantly enhances MLLMs' understanding in complex multi-image scenarios, while also improving performance in single-image contexts.
arXiv Detail & Related papers (2024-08-22T11:57:16Z) - Masked Graph Learning with Recurrent Alignment for Multimodal Emotion Recognition in Conversation [12.455034591553506]
Multimodal Emotion Recognition in Conversation (MERC) can be applied to public opinion monitoring, intelligent dialogue robots, and other fields.
Previous work ignored the inter-modal alignment process and the intra-modal noise information before multimodal fusion.
We have developed a novel approach called Masked Graph Learning with Recursive Alignment (MGLRA) to tackle this problem.
arXiv Detail & Related papers (2024-07-23T02:23:51Z) - TMT: Tri-Modal Translation between Speech, Image, and Text by Processing
Different Modalities as Different Languages [96.8603701943286]
Tri-Modal Translation (TMT) model translates between arbitrary modalities spanning speech, image, and text.
We tokenize speech and image data into discrete tokens, which provide a unified interface across modalities.
TMT outperforms single model counterparts consistently.
arXiv Detail & Related papers (2024-02-25T07:46:57Z) - Unsupervised Multimodal Language Representations using Convolutional
Autoencoders [5.464072883537924]
We propose extracting unsupervised Multimodal Language representations that are universal and can be applied to different tasks.
We map the word-level aligned multimodal sequences to 2-D matrices and then use Convolutional Autoencoders to learn embeddings by combining multiple datasets.
It is also shown that our method is extremely lightweight and can be easily generalized to other tasks and unseen data with small performance drop and almost the same number of parameters.
arXiv Detail & Related papers (2021-10-06T18:28:07Z) - Multiplex Graph Networks for Multimodal Brain Network Analysis [30.195666008281915]
We propose MGNet, a simple and effective multiplex graph convolutional network (GCN) model for multimodal brain network analysis.
We conduct classification task on two challenging real-world datasets (HIV and Bipolar disorder)
arXiv Detail & Related papers (2021-07-31T06:01:29Z) - Encoder Fusion Network with Co-Attention Embedding for Referring Image
Segmentation [87.01669173673288]
We propose an encoder fusion network (EFN), which transforms the visual encoder into a multi-modal feature learning network.
A co-attention mechanism is embedded in the EFN to realize the parallel update of multi-modal features.
The experiment results on four benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-05-05T02:27:25Z) - Dynamic Context-guided Capsule Network for Multimodal Machine
Translation [131.37130887834667]
Multimodal machine translation (MMT) mainly focuses on enhancing text-only translation with visual features.
We propose a novel Dynamic Context-guided Capsule Network (DCCN) for MMT.
Experimental results on the Multi30K dataset of English-to-German and English-to-French translation demonstrate the superiority of DCCN.
arXiv Detail & Related papers (2020-09-04T06:18:24Z) - Unsupervised Multimodal Neural Machine Translation with Pseudo Visual
Pivoting [105.5303416210736]
Unsupervised machine translation (MT) has recently achieved impressive results with monolingual corpora only.
It is still challenging to associate source-target sentences in the latent space.
As people speak different languages biologically share similar visual systems, the potential of achieving better alignment through visual content is promising.
arXiv Detail & Related papers (2020-05-06T20:11:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.