Efficient Object-Level Visual Context Modeling for Multimodal Machine
Translation: Masking Irrelevant Objects Helps Grounding
- URL: http://arxiv.org/abs/2101.05208v1
- Date: Fri, 18 Dec 2020 11:10:00 GMT
- Title: Efficient Object-Level Visual Context Modeling for Multimodal Machine
Translation: Masking Irrelevant Objects Helps Grounding
- Authors: Dexin Wang and Deyi Xiong
- Abstract summary: We propose an object-level visual context modeling framework (OVC) to efficiently capture and explore visual information for multimodal machine translation.
OVC encourages MMT to ground translation on desirable visual objects by masking irrelevant objects in the visual modality.
Experiments on MMT datasets demonstrate that the proposed OVC model outperforms state-of-the-art MMT models.
- Score: 25.590409802797538
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual context provides grounding information for multimodal machine
translation (MMT). However, previous MMT models and probing studies on visual
features suggest that visual information is less explored in MMT as it is often
redundant to textual information. In this paper, we propose an object-level
visual context modeling framework (OVC) to efficiently capture and explore
visual information for multimodal machine translation. With detected objects,
the proposed OVC encourages MMT to ground translation on desirable visual
objects by masking irrelevant objects in the visual modality. We equip the
proposed with an additional object-masking loss to achieve this goal. The
object-masking loss is estimated according to the similarity between masked
objects and the source texts so as to encourage masking source-irrelevant
objects. Additionally, in order to generate vision-consistent target words, we
further propose a vision-weighted translation loss for OVC. Experiments on MMT
datasets demonstrate that the proposed OVC model outperforms state-of-the-art
MMT models and analyses show that masking irrelevant objects helps grounding in
MMT.
Related papers
- Towards Interpreting Visual Information Processing in Vision-Language Models [24.51408101801313]
Vision-Language Models (VLMs) are powerful tools for processing and understanding text and images.
We study the processing of visual tokens in the language model component of LLaVA, a prominent VLM.
arXiv Detail & Related papers (2024-10-09T17:55:02Z) - ClawMachine: Fetching Visual Tokens as An Entity for Referring and Grounding [67.63933036920012]
Existing methods, including proxy encoding and geometry encoding, incorporate additional syntax to encode the object's location.
This study presents ClawMachine, offering a new methodology that notates an entity directly using the visual tokens.
ClawMachine unifies visual referring and grounding into an auto-regressive format and learns with a decoder-only architecture.
arXiv Detail & Related papers (2024-06-17T08:39:16Z) - 3AM: An Ambiguity-Aware Multi-Modal Machine Translation Dataset [90.95948101052073]
We introduce 3AM, an ambiguity-aware MMT dataset comprising 26,000 parallel sentence pairs in English and Chinese.
Our dataset is specifically designed to include more ambiguity and a greater variety of both captions and images than other MMT datasets.
Experimental results show that MMT models trained on our dataset exhibit a greater ability to exploit visual information than those trained on other MMT datasets.
arXiv Detail & Related papers (2024-04-29T04:01:30Z) - Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references.
Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z) - Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language
Pre-training [87.69394953339238]
Masked image modeling (MIM) has recently been introduced for fine-grained cross-modal alignment.
We propose a semantics-enhanced cross-modal MIM framework (SemMIM) for vision-language representation learning.
arXiv Detail & Related papers (2024-03-01T03:25:58Z) - Self-Supervised Learning for Visual Relationship Detection through
Masked Bounding Box Reconstruction [6.798515070856465]
We present a novel self-supervised approach for representation learning, particularly for the task of Visual Relationship Detection (VRD)
Motivated by the effectiveness of Masked Image Modeling (MIM), we propose Masked Bounding Box Reconstruction (MBBR)
arXiv Detail & Related papers (2023-11-08T16:59:26Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z) - Increasing Visual Awareness in Multimodal Neural Machine Translation
from an Information Theoretic Perspective [14.100033405711685]
Multimodal machine translation (MMT) aims to improve translation quality by equipping the source sentence with its corresponding image.
In this paper, we endeavor to improve MMT performance by increasing visual awareness from an information theoretic perspective.
arXiv Detail & Related papers (2022-10-16T08:11:44Z) - Multi-modal Transformers Excel at Class-agnostic Object Detection [105.10403103027306]
We argue that existing methods lack a top-down supervision signal governed by human-understandable semantics.
We develop an efficient and flexible MViT architecture using multi-scale feature processing and deformable self-attention.
We show the significance of MViT proposals in a diverse range of applications.
arXiv Detail & Related papers (2021-11-22T18:59:29Z) - Vision Matters When It Should: Sanity Checking Multimodal Machine
Translation Models [25.920891392933058]
Multimodal machine translation (MMT) systems have been shown to outperform their text-only neural machine translation (NMT) counterparts when visual context is available.
Recent studies have also shown that the performance of MMT models is only marginally impacted when the associated image is replaced with an unrelated image or noise.
arXiv Detail & Related papers (2021-09-08T03:32:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.