On Vision Features in Multimodal Machine Translation
- URL: http://arxiv.org/abs/2203.09173v1
- Date: Thu, 17 Mar 2022 08:51:09 GMT
- Title: On Vision Features in Multimodal Machine Translation
- Authors: Bei Li, Chuanhao Lv, Zefan Zhou, Tao Zhou, Tong Xiao, Anxiang Ma and
JingBo Zhu
- Abstract summary: We develop a selective attention model to study the patch-level contribution of an image in multimodal machine translation.
Our results suggest the need of carefully examining MMT models, especially when current benchmarks are small-scale and biased.
- Score: 34.41229863267296
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Previous work on multimodal machine translation (MMT) has focused on the way
of incorporating vision features into translation but little attention is on
the quality of vision models. In this work, we investigate the impact of vision
models on MMT. Given the fact that Transformer is becoming popular in computer
vision, we experiment with various strong models (such as Vision Transformer)
and enhanced features (such as object-detection and image captioning). We
develop a selective attention model to study the patch-level contribution of an
image in MMT. On detailed probing tasks, we find that stronger vision models
are helpful for learning translation from the visual modality. Our results also
suggest the need of carefully examining MMT models, especially when current
benchmarks are small-scale and biased. Our code could be found at
\url{https://github.com/libeineu/fairseq_mmt}.
Related papers
- VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks [60.22144823791902]
We unveil a LLaMA-like vision transformer in plain and pyramid forms, termed VisionLLaMA, which is tailored for this purpose.
VisionLLaMA is a unified and generic modelling framework for solving most vision tasks.
arXiv Detail & Related papers (2024-03-01T13:30:51Z) - InternVL: Scaling up Vision Foundation Models and Aligning for Generic
Visual-Linguistic Tasks [92.03764152132315]
We design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters.
This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks.
It has powerful visual capabilities and can be a good alternative to the ViT-22B.
arXiv Detail & Related papers (2023-12-21T18:59:31Z) - Vision Language Transformers: A Survey [0.9137554315375919]
Vision language tasks, such as answering questions about or generating captions that describe an image, are difficult tasks for computers to perform.
Recent research has adapted the pretrained transformer architecture introduced in citetvaswani 2017attention to vision language modeling.
Transformer models have greatly improved performance and versatility over previous vision language models.
arXiv Detail & Related papers (2023-07-06T19:08:56Z) - Scene Graph as Pivoting: Inference-time Image-free Unsupervised
Multimodal Machine Translation with Visual Scene Hallucination [88.74459704391214]
In this work, we investigate a more realistic unsupervised multimodal machine translation (UMMT) setup.
We represent the input images and texts with the visual and language scene graphs (SG), where such fine-grained vision-language features ensure a holistic understanding of the semantics.
Several SG-pivoting based learning objectives are introduced for unsupervised translation training.
Our method outperforms the best-performing baseline by significant BLEU scores on the task and setup.
arXiv Detail & Related papers (2023-05-20T18:17:20Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z) - Image as a Foreign Language: BEiT Pretraining for All Vision and
Vision-Language Tasks [87.6494641931349]
We introduce a general-purpose multimodal foundation model BEiT-3.
It achieves state-of-the-art transfer performance on both vision and vision-language tasks.
arXiv Detail & Related papers (2022-08-22T16:55:04Z) - Vision Matters When It Should: Sanity Checking Multimodal Machine
Translation Models [25.920891392933058]
Multimodal machine translation (MMT) systems have been shown to outperform their text-only neural machine translation (NMT) counterparts when visual context is available.
Recent studies have also shown that the performance of MMT models is only marginally impacted when the associated image is replaced with an unrelated image or noise.
arXiv Detail & Related papers (2021-09-08T03:32:48Z) - Unsupervised Multimodal Neural Machine Translation with Pseudo Visual
Pivoting [105.5303416210736]
Unsupervised machine translation (MT) has recently achieved impressive results with monolingual corpora only.
It is still challenging to associate source-target sentences in the latent space.
As people speak different languages biologically share similar visual systems, the potential of achieving better alignment through visual content is promising.
arXiv Detail & Related papers (2020-05-06T20:11:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.