VALHALLA: Visual Hallucination for Machine Translation
- URL: http://arxiv.org/abs/2206.00100v1
- Date: Tue, 31 May 2022 20:25:15 GMT
- Title: VALHALLA: Visual Hallucination for Machine Translation
- Authors: Yi Li, Rameswar Panda, Yoon Kim, Chun-Fu (Richard) Chen, Rogerio
Feris, David Cox, Nuno Vasconcelos
- Abstract summary: We introduce a visual hallucination framework, called VALHALLA.
It requires only source sentences at inference time and instead uses hallucinated visual representations for multimodal machine translation.
In particular, given a source sentence an autoregressive hallucination transformer is used to predict a discrete visual representation from the input text.
- Score: 64.86515924691899
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Designing better machine translation systems by considering auxiliary inputs
such as images has attracted much attention in recent years. While existing
methods show promising performance over the conventional text-only translation
systems, they typically require paired text and image as input during
inference, which limits their applicability to real-world scenarios. In this
paper, we introduce a visual hallucination framework, called VALHALLA, which
requires only source sentences at inference time and instead uses hallucinated
visual representations for multimodal machine translation. In particular, given
a source sentence an autoregressive hallucination transformer is used to
predict a discrete visual representation from the input text, and the combined
text and hallucinated representations are utilized to obtain the target
translation. We train the hallucination transformer jointly with the
translation transformer using standard backpropagation with cross-entropy
losses while being guided by an additional loss that encourages consistency
between predictions using either ground-truth or hallucinated visual
representations. Extensive experiments on three standard translation datasets
with a diverse set of language pairs demonstrate the effectiveness of our
approach over both text-only baselines and state-of-the-art methods. Project
page: http://www.svcl.ucsd.edu/projects/valhalla.
Related papers
- VaLiD: Mitigating the Hallucination of Large Vision Language Models by Visual Layer Fusion Contrastive Decoding [38.23310445372371]
Large Vision-Language Models (LVLMs) have demonstrated outstanding performance in multimodal task reasoning.
We propose a novel hallucination-mitigation method from the visual encoding perspective: textbfVisutextbfal textbfLayer Fustextbfion Contrastive textbfDecoding (VaLiD)
arXiv Detail & Related papers (2024-11-24T13:42:02Z) - Exploring the Necessity of Visual Modality in Multimodal Machine Translation using Authentic Datasets [3.54128607634285]
We study the impact of the visual modality on translation efficacy by leveraging real-world translation datasets.
We find that the visual modality proves advantageous for the majority of authentic translation datasets.
Our results suggest that visual information serves a supplementary role in multimodal translation and can be substituted.
arXiv Detail & Related papers (2024-04-09T08:19:10Z) - Hallucination Augmented Contrastive Learning for Multimodal Large
Language Model [53.65682783591723]
Multi-modal large language models (MLLMs) have been shown to efficiently integrate natural language with visual information to handle multi-modal tasks.
However, MLLMs still face a fundamental limitation of hallucinations, where they tend to generate erroneous or fabricated information.
In this paper, we address hallucinations in MLLMs from a novel perspective of representation learning.
arXiv Detail & Related papers (2023-12-12T04:05:15Z) - Scene Graph as Pivoting: Inference-time Image-free Unsupervised
Multimodal Machine Translation with Visual Scene Hallucination [88.74459704391214]
In this work, we investigate a more realistic unsupervised multimodal machine translation (UMMT) setup.
We represent the input images and texts with the visual and language scene graphs (SG), where such fine-grained vision-language features ensure a holistic understanding of the semantics.
Several SG-pivoting based learning objectives are introduced for unsupervised translation training.
Our method outperforms the best-performing baseline by significant BLEU scores on the task and setup.
arXiv Detail & Related papers (2023-05-20T18:17:20Z) - Augmented Transformers with Adaptive n-grams Embedding for Multilingual
Scene Text Recognition [10.130342722193204]
This paper proposes an augmented transformer architecture with n-grams embedding and cross-language rectification (TANGER)
TANGER consists of a primary transformer with single patch embeddings of visual images, and a supplementary transformer with adaptive n-grams embeddings.
Cross-language rectification is achieved with a loss function that takes into account both language identification and contextual coherence scoring.
arXiv Detail & Related papers (2023-02-28T02:37:30Z) - Multimodal Neural Machine Translation with Search Engine Based Image
Retrieval [4.662583832063716]
We propose an open-vocabulary image retrieval method to collect descriptive images for bilingual parallel corpus.
Our proposed method achieves significant improvements over strong baselines.
arXiv Detail & Related papers (2022-07-26T08:42:06Z) - Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model.
During the training phase, the modality transition network is optimised by the proposed modality loss.
Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z) - Generative Imagination Elevates Machine Translation [37.78397666835735]
We propose ImagiT, a novel machine translation method via visual imagination.
ImagiT first learns to generate visual representation from the source sentence, and then utilizes both source sentence and the "imagined representation" to produce a target translation.
Experiments demonstrate that ImagiT benefits from visual imagination and significantly outperforms the text-only neural machine translation baselines.
arXiv Detail & Related papers (2020-09-21T07:44:04Z) - Unsupervised Multimodal Neural Machine Translation with Pseudo Visual
Pivoting [105.5303416210736]
Unsupervised machine translation (MT) has recently achieved impressive results with monolingual corpora only.
It is still challenging to associate source-target sentences in the latent space.
As people speak different languages biologically share similar visual systems, the potential of achieving better alignment through visual content is promising.
arXiv Detail & Related papers (2020-05-06T20:11:46Z) - Translation Artifacts in Cross-lingual Transfer Learning [51.66536640084888]
We show that machine translation can introduce subtle artifacts that have a notable impact in existing cross-lingual models.
In natural language inference, translating the premise and the hypothesis independently can reduce the lexical overlap between them.
We also improve the state-of-the-art in XNLI for the translate-test and zero-shot approaches by 4.3 and 2.8 points, respectively.
arXiv Detail & Related papers (2020-04-09T17:54:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.