Fugu-MT 論文翻訳(概要): Directed-Tokens: A Robust Multi-Modality Alignment Approach to Large Language-Vision Models

論文の概要: Directed-Tokens: A Robust Multi-Modality Alignment Approach to Large Language-Vision Models

arxiv url: http://arxiv.org/abs/2508.14264v1
Date: Tue, 19 Aug 2025 20:53:24 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-21 16:52:41.268165
Title: Directed-Tokens: A Robust Multi-Modality Alignment Approach to Large Language-Vision Models
Title（参考訳）: Directed-Tokens: 大規模言語ビジョンモデルに対するロバストな多モードアライメントアプローチ
Authors: Thanh-Dat Truong, Huu-Thien Tran, Tran Thai Son, Bhiksha Raj, Khoa Luu,
Abstract要約: 視覚的・テキスト的モダリティ間のロバストなアライメントを改善するための,シンプルだが効率的な学習機構を提案する。提案手法は,従来のLMMと比較して常に最先端(SoTA)性能を実現する。
参考スコア（独自算出の注目度）: 28.82265769298008
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large multimodal models (LMMs) have gained impressive performance due to their outstanding capability in various understanding tasks. However, these models still suffer from some fundamental limitations related to robustness and generalization due to the alignment and correlation between visual and textual features. In this paper, we introduce a simple but efficient learning mechanism for improving the robust alignment between visual and textual modalities by solving shuffling problems. In particular, the proposed approach can improve reasoning capability, visual understanding, and cross-modality alignment by introducing two new tasks: reconstructing the image order and the text order into the LMM's pre-training and fine-tuning phases. In addition, we propose a new directed-token approach to capture visual and textual knowledge, enabling the capability to reconstruct the correct order of visual inputs. Then, we introduce a new Image-to-Response Guided loss to further improve the visual understanding of the LMM in its responses. The proposed approach consistently achieves state-of-the-art (SoTA) performance compared with prior LMMs on academic task-oriented and instruction-following LMM benchmarks.
Abstract（参考訳）: 大規模マルチモーダルモデル (LMM) は, 様々な理解タスクにおいて, 優れた性能を発揮している。しかしながら、これらのモデルは、視覚的特徴とテキスト的特徴のアライメントと相関により、ロバスト性や一般化に関連するいくつかの基本的な制限に悩まされている。本稿では,シャッフル問題を解くことで,視覚的・テキスト的モダリティのロバストな整合性を改善するための,シンプルだが効率的な学習機構を提案する。特に,提案手法は,画像の順序とテキストの順序をLMMの事前学習段階と微調整段階に再構築することで,推論能力,視覚的理解,モダリティ間のアライメントを改善することができる。さらに,視覚的およびテキスト的知識を抽出し,視覚的入力の正しい順序を再構築する新たな指向的アプローチを提案する。そして,LMMの応答における視覚的理解をさらに向上させるために,新しいイメージ・ツー・レスポンス・ガイドド・ロスを導入した。提案手法は,学術的なタスク指向および命令追従型LMMベンチマークにおいて,従来のLMMと比較して,最新技術(SoTA)性能を一貫して達成する。

論文の概要: Directed-Tokens: A Robust Multi-Modality Alignment Approach to Large Language-Vision Models

関連論文リスト