Related papers: Directed-Tokens: A Robust Multi-Modality Alignment Approach to Large Language-Vision Models

Directed-Tokens: A Robust Multi-Modality Alignment Approach to Large Language-Vision Models

URL: http://arxiv.org/abs/2508.14264v1
Date: Tue, 19 Aug 2025 20:53:24 GMT
Title: Directed-Tokens: A Robust Multi-Modality Alignment Approach to Large Language-Vision Models
Authors: Thanh-Dat Truong, Huu-Thien Tran, Tran Thai Son, Bhiksha Raj, Khoa Luu,
Abstract summary: We introduce a simple but efficient learning mechanism for improving the robust alignment between visual and textual modalities.<n>The proposed approach consistently achieves state-of-the-art (SoTA) performance compared with prior LMMs.
Score: 28.82265769298008
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large multimodal models (LMMs) have gained impressive performance due to their outstanding capability in various understanding tasks. However, these models still suffer from some fundamental limitations related to robustness and generalization due to the alignment and correlation between visual and textual features. In this paper, we introduce a simple but efficient learning mechanism for improving the robust alignment between visual and textual modalities by solving shuffling problems. In particular, the proposed approach can improve reasoning capability, visual understanding, and cross-modality alignment by introducing two new tasks: reconstructing the image order and the text order into the LMM's pre-training and fine-tuning phases. In addition, we propose a new directed-token approach to capture visual and textual knowledge, enabling the capability to reconstruct the correct order of visual inputs. Then, we introduce a new Image-to-Response Guided loss to further improve the visual understanding of the LMM in its responses. The proposed approach consistently achieves state-of-the-art (SoTA) performance compared with prior LMMs on academic task-oriented and instruction-following LMM benchmarks.

Related papers

Reinforcing Multimodal Understanding and Generation with Dual Self-rewards [56.08202047680044]
Large language models (LLMs) unify cross-model understanding and generation into a single framework.<n>Current solutions require external supervision (e.g., human feedback or reward models) and only address unidirectional tasks.<n>We introduce a self-supervised dual reward mechanism to reinforce the understanding and generation capabilities of LMMs.
arXiv Detail & Related papers (2025-06-09T17:38:45Z)
MIND-Edit: MLLM Insight-Driven Editing via Language-Vision Projection [13.467269066605452]
We propose MIND-Edit, an end-to-end image-editing framework integrating pretrained diffusion model with MLLM.<n> MIND-Edit introduces two complementary strategies: (1) a text instruction optimization strategy that clarifies ambiguous user instructions based on semantic reasoning from the MLLM, and (2) an MLLM insight-driven editing strategy that explicitly leverages the intrinsic visual understanding capability of the MLLM to infer editing intent.<n>Extensive experiments demonstrate that MIND-Edit outperforms state-of-the-art image editing methods in both quantitative metrics and visual quality, particularly under complex and challenging scenarios.
arXiv Detail & Related papers (2025-05-25T13:54:31Z)
Integrating Visual Interpretation and Linguistic Reasoning for Math Problem Solving [61.992824291296444]
Current large vision-language models (LVLMs) typically employ a connector module to link visual features with text embeddings of large language models (LLMs)<n>This paper proposes a paradigm shift: instead of training end-to-end vision-language reasoning models, we advocate for developing a decoupled reasoning framework.
arXiv Detail & Related papers (2025-05-23T08:18:00Z)
Mitigating Visual Knowledge Forgetting in MLLM Instruction-tuning via Modality-decoupled Gradient Descent [72.1517476116743]
Recent MLLMs have shown emerging visual understanding and reasoning abilities after being pre-trained on large-scale multimodal datasets.<n>Existing approaches, such as direct fine-tuning and continual learning methods, fail to explicitly address this issue.<n>We introduce a novel perspective leveraging effective rank to quantify the degradation of visual representation forgetting.<n>We propose a modality-decoupled gradient descent (MDGD) method that regulates gradient updates to maintain the effective rank of visual representations.
arXiv Detail & Related papers (2025-02-17T12:26:34Z)
SILMM: Self-Improving Large Multimodal Models for Compositional Text-to-Image Generation [92.73405185996315]
Large Multimodal Models (LMMs) have demonstrated impressive capabilities in multimodal understanding and generation.<n>Existing approaches, such as layout planning for multi-step generation and learning from human feedback or AI feedback, depend heavily on prompt engineering.<n>We introduce a model-agnostic iterative self-feedback framework (SILMM) that can enable LMMs to provide helpful and scalable self-improvement and optimize text-image alignment.
arXiv Detail & Related papers (2024-12-08T05:28:08Z)
CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models [51.70129969269271]
We introduce a novel contrastive-based decoding method, COuntering DEscription Contrastive Decoding (CODE) Our method significantly reduces hallucinations and improves cross-modal consistency across various benchmarks and cutting-edge LMMs.
arXiv Detail & Related papers (2024-06-04T03:04:21Z)
Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest. This technique allows LVLMs to access more detailed visual information without altering the original image resolution. Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z)
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation. It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.