Related papers: Increasing Visual Awareness in Multimodal Neural Machine Translation from an Information Theoretic Perspective

Related papers

Multimodal Machine Translation with Visual Scene Graph Pruning [31.85382347738067]
Multimodal machine translation (MMT) seeks to address the challenges posed by linguistic polysemy and ambiguity in translation tasks by incorporating visual information.<n>We introduce a novel approach--multimodal machine translation with visual Scene Graph Pruning (PSG)<n>PSG leverages language scene graph information to guide the pruning of redundant nodes in visual scene graphs, thereby reducing noise in downstream translation tasks.
arXiv Detail & Related papers (2025-05-26T04:35:03Z)
Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought [72.93910800095757]
multimodal chain-of-thought (MCoT) improves performance and interpretability of large vision-language models (LVLMs)<n>We show that MCoT boosts LVLMs by incorporating visual thoughts, which convey image information to the reasoning process regardless of the MCoT format.<n>We also explore the internal nature of visual thoughts, finding that visual thoughts serve as intermediaries between the input image and reasoning to deeper transformer layers.
arXiv Detail & Related papers (2025-05-21T13:29:58Z)
Exploring Implicit Visual Misunderstandings in Multimodal Large Language Models through Attention Analysis [21.869968563545736]
We define implicit visual misunderstanding (IVM), where MLLMs provide correct answers without fully comprehending the visual input.<n>We introduce a scale-agnostic metric, textitattention accuracy, and a novel benchmark for quantifying IVMs.<n>We extend our approach to finer granularities and demonstrate its effectiveness in unimodal scenarios.
arXiv Detail & Related papers (2025-05-15T17:52:40Z)
Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference [28.24397677839652]
Multimodal large language models (MLLMs) improve performance on vision-language tasks by integrating visual features from pre-trained vision encoders into large language models. How MLLMs process and utilize visual information remains unclear. We propose Hierarchical Modality-Aware Pruning (HiMAP), a plug-and-play inference acceleration method that dynamically prunes image tokens at specific layers, reducing computational costs by approximately 65% without sacrificing performance.
arXiv Detail & Related papers (2025-03-17T12:31:23Z)
Make Imagination Clearer! Stable Diffusion-based Visual Imagination for Multimodal Machine Translation [40.42326040668964]
We introduce a stable diffusion-based imagination network into a multimodal large language model (MLLM) to explicitly generate an image for each source sentence. We build human feedback with reinforcement learning to ensure the consistency of the generated image with the source sentence. Experimental results show that our model significantly outperforms existing multimodal MT and text-only MT.
arXiv Detail & Related papers (2024-12-17T07:41:23Z)
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct [148.39859547619156]
We propose MMEvol, a novel multimodal instruction data evolution framework. MMEvol iteratively improves data quality through a refined combination of fine-grained perception, cognitive reasoning, and interaction evolution. Our approach reaches state-of-the-art (SOTA) performance in nine tasks using significantly less data compared to state-of-the-art models.
arXiv Detail & Related papers (2024-09-09T17:44:00Z)
Leveraging Entity Information for Cross-Modality Correlation Learning: The Entity-Guided Multimodal Summarization [49.08348604716746]
Multimodal Summarization with Multimodal Output (MSMO) aims to produce a multimodal summary that integrates both text and relevant images. In this paper, we propose an Entity-Guided Multimodal Summarization model (EGMS) Our model, building on BART, utilizes dual multimodal encoders with shared weights to process text-image and entity-image information concurrently.
arXiv Detail & Related papers (2024-08-06T12:45:56Z)
Enhancing Large Vision Language Models with Self-Training on Image Comprehension [131.14381425260706]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension. First, the model self-constructs a preference for image descriptions using unlabeled images. To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z)
3AM: An Ambiguity-Aware Multi-Modal Machine Translation Dataset [90.95948101052073]
We introduce 3AM, an ambiguity-aware MMT dataset comprising 26,000 parallel sentence pairs in English and Chinese. Our dataset is specifically designed to include more ambiguity and a greater variety of both captions and images than other MMT datasets. Experimental results show that MMT models trained on our dataset exhibit a greater ability to exploit visual information than those trained on other MMT datasets.
arXiv Detail & Related papers (2024-04-29T04:01:30Z)
Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models [121.83413400686139]
This paper proposes to improve the visual perception ability of MLLMs through a mixture-of-experts knowledge enhancement mechanism. We introduce a novel method that incorporates multi-task encoders and visual tools into the existing MLLMs training and inference pipeline.
arXiv Detail & Related papers (2024-01-06T02:02:34Z)
Information Screening whilst Exploiting! Multimodal Relation Extraction with Feature Denoising and Multimodal Topic Modeling [96.75821232222201]
Existing research on multimodal relation extraction (MRE) faces two co-existing challenges, internal-information over-utilization and external-information under-exploitation. We propose a novel framework that simultaneously implements the idea of internal-information screening and external-information exploiting.
arXiv Detail & Related papers (2023-05-19T14:56:57Z)
Generalization algorithm of multimodal pre-training model based on graph-text self-supervised training [0.0]
multimodal pre-training generalization algorithm for self-supervised training is proposed. We show that when the filtered information is used as multimodal machine translation for fine-tuning, the effect of translation in the global voice dataset is 0.5 BLEU higher than the baseline.
arXiv Detail & Related papers (2023-02-16T03:34:08Z)
Vision Matters When It Should: Sanity Checking Multimodal Machine Translation Models [25.920891392933058]
Multimodal machine translation (MMT) systems have been shown to outperform their text-only neural machine translation (NMT) counterparts when visual context is available. Recent studies have also shown that the performance of MMT models is only marginally impacted when the associated image is replaced with an unrelated image or noise.
arXiv Detail & Related papers (2021-09-08T03:32:48Z)
Exploiting Multimodal Reinforcement Learning for Simultaneous Machine Translation [33.698254673743904]
We explore two main concepts: (a) adaptive policies to learn a good trade-off between high translation quality and low latency; and (b) visual information to support this process. We propose a multimodal approach to simultaneous machine translation using reinforcement learning, with strategies to integrate visual and textual information in both the agent and the environment.
arXiv Detail & Related papers (2021-02-22T22:26:22Z)
Efficient Object-Level Visual Context Modeling for Multimodal Machine Translation: Masking Irrelevant Objects Helps Grounding [25.590409802797538]
We propose an object-level visual context modeling framework (OVC) to efficiently capture and explore visual information for multimodal machine translation. OVC encourages MMT to ground translation on desirable visual objects by masking irrelevant objects in the visual modality. Experiments on MMT datasets demonstrate that the proposed OVC model outperforms state-of-the-art MMT models.
arXiv Detail & Related papers (2020-12-18T11:10:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.