Multimodal Neural Machine Translation with Search Engine Based Image
Retrieval
- URL: http://arxiv.org/abs/2208.00767v1
- Date: Tue, 26 Jul 2022 08:42:06 GMT
- Title: Multimodal Neural Machine Translation with Search Engine Based Image
Retrieval
- Authors: ZhenHao Tang, XiaoBing Zhang, Zi Long, XiangHua Fu
- Abstract summary: We propose an open-vocabulary image retrieval method to collect descriptive images for bilingual parallel corpus.
Our proposed method achieves significant improvements over strong baselines.
- Score: 4.662583832063716
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, numbers of works shows that the performance of neural machine
translation (NMT) can be improved to a certain extent with using visual
information. However, most of these conclusions are drawn from the analysis of
experimental results based on a limited set of bilingual sentence-image pairs,
such as Multi30K. In these kinds of datasets, the content of one bilingual
parallel sentence pair must be well represented by a manually annotated image,
which is different with the actual translation situation. Some previous works
are proposed to addressed the problem by retrieving images from exiting
sentence-image pairs with topic model. However, because of the limited
collection of sentence-image pairs they used, their image retrieval method is
difficult to deal with the out-of-vocabulary words, and can hardly prove that
visual information enhance NMT rather than the co-occurrence of images and
sentences. In this paper, we propose an open-vocabulary image retrieval methods
to collect descriptive images for bilingual parallel corpus using image search
engine. Next, we propose text-aware attentive visual encoder to filter
incorrectly collected noise images. Experiment results on Multi30K and other
two translation datasets show that our proposed method achieves significant
improvements over strong baselines.
Related papers
- Towards Better Multi-modal Keyphrase Generation via Visual Entity
Enhancement and Multi-granularity Image Noise Filtering [79.44443231700201]
Multi-modal keyphrase generation aims to produce a set of keyphrases that represent the core points of the input text-image pair.
The input text and image are often not perfectly matched, and thus the image may introduce noise into the model.
We propose a novel multi-modal keyphrase generation model, which not only enriches the model input with external knowledge, but also effectively filters image noise.
arXiv Detail & Related papers (2023-09-09T09:41:36Z) - Improving Multimodal Datasets with Image Captioning [65.74736570293622]
We study how generated captions can increase the utility of web-scraped datapoints with nondescript text.
Our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text.
arXiv Detail & Related papers (2023-07-19T17:47:12Z) - Generalization algorithm of multimodal pre-training model based on
graph-text self-supervised training [0.0]
multimodal pre-training generalization algorithm for self-supervised training is proposed.
We show that when the filtered information is used as multimodal machine translation for fine-tuning, the effect of translation in the global voice dataset is 0.5 BLEU higher than the baseline.
arXiv Detail & Related papers (2023-02-16T03:34:08Z) - Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks.
For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs.
Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z) - NewsStories: Illustrating articles with visual summaries [49.924916589209374]
We introduce a large-scale multimodal dataset containing over 31M articles, 22M images and 1M videos.
We show that state-of-the-art image-text alignment methods are not robust to longer narratives with multiple images.
We introduce an intuitive baseline that outperforms these methods on zero-shot image-set retrieval by 10% on the GoodNews dataset.
arXiv Detail & Related papers (2022-07-26T17:34:11Z) - Neural Machine Translation with Phrase-Level Universal Visual
Representations [11.13240570688547]
We propose a phrase-level retrieval-based method for MMT to get visual information for the source input from existing sentence-image data sets.
Our method performs retrieval at the phrase level and hence learns visual information from pairs of source phrase and grounded region.
Experiments show that the proposed method significantly outperforms strong baselines on multiple MMT datasets.
arXiv Detail & Related papers (2022-03-19T11:21:13Z) - Multi-domain Unsupervised Image-to-Image Translation with Appearance
Adaptive Convolution [62.4972011636884]
We propose a novel multi-domain unsupervised image-to-image translation (MDUIT) framework.
We exploit the decomposed content feature and appearance adaptive convolution to translate an image into a target appearance.
We show that the proposed method produces visually diverse and plausible results in multiple domains compared to the state-of-the-art methods.
arXiv Detail & Related papers (2022-02-06T14:12:34Z) - Towards Efficient Cross-Modal Visual Textual Retrieval using
Transformer-Encoder Deep Features [10.163477961551592]
Cross-modal retrieval is an important functionality in modern search engines.
In this paper, we focus on the image-sentence retrieval task.
We use the recently introduced TERN architecture as an image-sentence features extractor.
arXiv Detail & Related papers (2021-06-01T10:11:46Z) - MultiSubs: A Large-scale Multimodal and Multilingual Dataset [32.48454703822847]
This paper introduces a large-scale multimodal and multilingual dataset that aims to facilitate research on grounding words to images in their contextual usage in language.
The dataset consists of images selected to unambiguously illustrate concepts expressed in sentences from movie subtitles.
We show the utility of the dataset on two automatic tasks: (i) fill-in-the blank; (ii) lexical translation.
arXiv Detail & Related papers (2021-03-02T18:09:07Z) - Text as Neural Operator: Image Manipulation by Text Instruction [68.53181621741632]
In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects.
The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image.
We show that the proposed model performs favorably against recent strong baselines on three public datasets.
arXiv Detail & Related papers (2020-08-11T07:07:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.