Increasing Visual Awareness in Multimodal Neural Machine Translation
from an Information Theoretic Perspective
- URL: http://arxiv.org/abs/2210.08478v1
- Date: Sun, 16 Oct 2022 08:11:44 GMT
- Title: Increasing Visual Awareness in Multimodal Neural Machine Translation
from an Information Theoretic Perspective
- Authors: Baijun Ji, Tong Zhang, Yicheng Zou, Bojie Hu and Si Shen
- Abstract summary: Multimodal machine translation (MMT) aims to improve translation quality by equipping the source sentence with its corresponding image.
In this paper, we endeavor to improve MMT performance by increasing visual awareness from an information theoretic perspective.
- Score: 14.100033405711685
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal machine translation (MMT) aims to improve translation quality by
equipping the source sentence with its corresponding image. Despite the
promising performance, MMT models still suffer the problem of input
degradation: models focus more on textual information while visual information
is generally overlooked. In this paper, we endeavor to improve MMT performance
by increasing visual awareness from an information theoretic perspective. In
detail, we decompose the informative visual signals into two parts:
source-specific information and target-specific information. We use mutual
information to quantify them and propose two methods for objective optimization
to better leverage visual signals. Experiments on two datasets demonstrate that
our approach can effectively enhance the visual awareness of MMT model and
achieve superior results against strong baselines.
Related papers
- MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct [148.39859547619156]
We propose MMEvol, a novel multimodal instruction data evolution framework.
MMEvol iteratively improves data quality through a refined combination of fine-grained perception, cognitive reasoning, and interaction evolution.
Our approach reaches state-of-the-art (SOTA) performance in nine tasks using significantly less data compared to state-of-the-art models.
arXiv Detail & Related papers (2024-09-09T17:44:00Z) - Leveraging Entity Information for Cross-Modality Correlation Learning: The Entity-Guided Multimodal Summarization [49.08348604716746]
Multimodal Summarization with Multimodal Output (MSMO) aims to produce a multimodal summary that integrates both text and relevant images.
In this paper, we propose an Entity-Guided Multimodal Summarization model (EGMS)
Our model, building on BART, utilizes dual multimodal encoders with shared weights to process text-image and entity-image information concurrently.
arXiv Detail & Related papers (2024-08-06T12:45:56Z) - Enhancing Large Vision Language Models with Self-Training on Image Comprehension [131.14381425260706]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension.
First, the model self-constructs a preference for image descriptions using unlabeled images.
To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z) - 3AM: An Ambiguity-Aware Multi-Modal Machine Translation Dataset [90.95948101052073]
We introduce 3AM, an ambiguity-aware MMT dataset comprising 26,000 parallel sentence pairs in English and Chinese.
Our dataset is specifically designed to include more ambiguity and a greater variety of both captions and images than other MMT datasets.
Experimental results show that MMT models trained on our dataset exhibit a greater ability to exploit visual information than those trained on other MMT datasets.
arXiv Detail & Related papers (2024-04-29T04:01:30Z) - Incorporating Visual Experts to Resolve the Information Loss in
Multimodal Large Language Models [121.83413400686139]
This paper proposes to improve the visual perception ability of MLLMs through a mixture-of-experts knowledge enhancement mechanism.
We introduce a novel method that incorporates multi-task encoders and visual tools into the existing MLLMs training and inference pipeline.
arXiv Detail & Related papers (2024-01-06T02:02:34Z) - Information Screening whilst Exploiting! Multimodal Relation Extraction
with Feature Denoising and Multimodal Topic Modeling [96.75821232222201]
Existing research on multimodal relation extraction (MRE) faces two co-existing challenges, internal-information over-utilization and external-information under-exploitation.
We propose a novel framework that simultaneously implements the idea of internal-information screening and external-information exploiting.
arXiv Detail & Related papers (2023-05-19T14:56:57Z) - Generalization algorithm of multimodal pre-training model based on
graph-text self-supervised training [0.0]
multimodal pre-training generalization algorithm for self-supervised training is proposed.
We show that when the filtered information is used as multimodal machine translation for fine-tuning, the effect of translation in the global voice dataset is 0.5 BLEU higher than the baseline.
arXiv Detail & Related papers (2023-02-16T03:34:08Z) - Vision Matters When It Should: Sanity Checking Multimodal Machine
Translation Models [25.920891392933058]
Multimodal machine translation (MMT) systems have been shown to outperform their text-only neural machine translation (NMT) counterparts when visual context is available.
Recent studies have also shown that the performance of MMT models is only marginally impacted when the associated image is replaced with an unrelated image or noise.
arXiv Detail & Related papers (2021-09-08T03:32:48Z) - Exploiting Multimodal Reinforcement Learning for Simultaneous Machine
Translation [33.698254673743904]
We explore two main concepts: (a) adaptive policies to learn a good trade-off between high translation quality and low latency; and (b) visual information to support this process.
We propose a multimodal approach to simultaneous machine translation using reinforcement learning, with strategies to integrate visual and textual information in both the agent and the environment.
arXiv Detail & Related papers (2021-02-22T22:26:22Z) - Efficient Object-Level Visual Context Modeling for Multimodal Machine
Translation: Masking Irrelevant Objects Helps Grounding [25.590409802797538]
We propose an object-level visual context modeling framework (OVC) to efficiently capture and explore visual information for multimodal machine translation.
OVC encourages MMT to ground translation on desirable visual objects by masking irrelevant objects in the visual modality.
Experiments on MMT datasets demonstrate that the proposed OVC model outperforms state-of-the-art MMT models.
arXiv Detail & Related papers (2020-12-18T11:10:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.