Dynamic Context-guided Capsule Network for Multimodal Machine
Translation
- URL: http://arxiv.org/abs/2009.02016v1
- Date: Fri, 4 Sep 2020 06:18:24 GMT
- Title: Dynamic Context-guided Capsule Network for Multimodal Machine
Translation
- Authors: Huan Lin and Fandong Meng and Jinsong Su and Yongjing Yin and
Zhengyuan Yang and Yubin Ge and Jie Zhou and Jiebo Luo
- Abstract summary: Multimodal machine translation (MMT) mainly focuses on enhancing text-only translation with visual features.
We propose a novel Dynamic Context-guided Capsule Network (DCCN) for MMT.
Experimental results on the Multi30K dataset of English-to-German and English-to-French translation demonstrate the superiority of DCCN.
- Score: 131.37130887834667
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal machine translation (MMT), which mainly focuses on enhancing
text-only translation with visual features, has attracted considerable
attention from both computer vision and natural language processing
communities. Most current MMT models resort to attention mechanism, global
context modeling or multimodal joint representation learning to utilize visual
features. However, the attention mechanism lacks sufficient semantic
interactions between modalities while the other two provide fixed visual
context, which is unsuitable for modeling the observed variability when
generating translation. To address the above issues, in this paper, we propose
a novel Dynamic Context-guided Capsule Network (DCCN) for MMT. Specifically, at
each timestep of decoding, we first employ the conventional source-target
attention to produce a timestep-specific source-side context vector. Next, DCCN
takes this vector as input and uses it to guide the iterative extraction of
related visual features via a context-guided dynamic routing mechanism.
Particularly, we represent the input image with global and regional visual
features, we introduce two parallel DCCNs to model multimodal context vectors
with visual features at different granularities. Finally, we obtain two
multimodal context vectors, which are fused and incorporated into the decoder
for the prediction of the target word. Experimental results on the Multi30K
dataset of English-to-German and English-to-French translation demonstrate the
superiority of DCCN. Our code is available on
https://github.com/DeepLearnXMU/MM-DCCN.
Related papers
- A Case Study on Context-Aware Neural Machine Translation with Multi-Task Learning [49.62044186504516]
In document-level neural machine translation (DocNMT), multi-encoder approaches are common in encoding context and source sentences.
Recent studies have shown that the context encoder generates noise and makes the model robust to the choice of context.
This paper further investigates this observation by explicitly modelling context encoding through multi-task learning (MTL) to make the model sensitive to the choice of context.
arXiv Detail & Related papers (2024-07-03T12:50:49Z) - Fuse & Calibrate: A bi-directional Vision-Language Guided Framework for Referring Image Segmentation [8.383431263616105]
We introduce FCNet, a framework that employs a bi-directional guided fusion approach where both vision and language play guiding roles.
Specifically, we use a vision-guided approach to conduct initial multi-modal fusion, obtaining multi-modal features that focus on key vision information.
We then propose a language-guided calibration module to further calibrate these multi-modal features, ensuring they understand the context of the input sentence.
arXiv Detail & Related papers (2024-05-18T07:21:12Z) - CLIPTrans: Transferring Visual Knowledge with Pre-trained Models for
Multimodal Machine Translation [31.911593690549633]
multimodal machine translation (MMT) systems enhance neural machine translation (NMT) with visual knowledge.
Previous works face a challenge in training powerful MMT models from scratch due to the scarcity of annotated multilingual vision-language data.
We propose CLIPTrans, which simply adapts the independently pre-trained multimodal M-CLIP and the multilingual mBART.
arXiv Detail & Related papers (2023-08-29T11:29:43Z) - Improving Cross-modal Alignment for Text-Guided Image Inpainting [36.1319565907582]
Text-guided image inpainting (TGII) aims to restore missing regions based on a given text in a damaged image.
We propose a novel model for TGII by improving cross-modal alignment.
Our model achieves state-of-the-art performance compared with other strong competitors.
arXiv Detail & Related papers (2023-01-26T19:18:27Z) - Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation [87.49579477873196]
We first design a two-stream encoder to extract CNN-based visual features and transformer-based linguistic features hierarchically.
A vision-language mutual guidance (VLMG) module is inserted into the encoder multiple times to promote the hierarchical and progressive fusion of multi-modal features.
In order to promote the temporal alignment between frames, we propose a language-guided multi-scale dynamic filtering (LMDF) module.
arXiv Detail & Related papers (2022-03-30T01:06:13Z) - Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular
Vision-Language Pre-training [120.91411454661741]
We present a pre-trainable Universal-DEcoder Network (Uni-EDEN) to facilitate both vision-language perception and generation.
Uni-EDEN is a two-stream Transformer based structure, consisting of three modules: object and sentence encoders that separately learns the representations of each modality.
arXiv Detail & Related papers (2022-01-11T16:15:07Z) - Encoder Fusion Network with Co-Attention Embedding for Referring Image
Segmentation [87.01669173673288]
We propose an encoder fusion network (EFN), which transforms the visual encoder into a multi-modal feature learning network.
A co-attention mechanism is embedded in the EFN to realize the parallel update of multi-modal features.
The experiment results on four benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-05-05T02:27:25Z) - A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine
Translation [131.33610549540043]
We propose a novel graph-based multi-modal fusion encoder for NMT.
We first represent the input sentence and image using a unified multi-modal graph.
We then stack multiple graph-based multi-modal fusion layers that iteratively perform semantic interactions to learn node representations.
arXiv Detail & Related papers (2020-07-17T04:06:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.