Fugu-MT 論文翻訳(概要): Towards Self-Refinement of Vision-Language Models with Triangular Consistency

論文の概要: Towards Self-Refinement of Vision-Language Models with Triangular Consistency

arxiv url: http://arxiv.org/abs/2510.10487v1
Date: Sun, 12 Oct 2025 07:37:47 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 18:06:29.970449
Title: Towards Self-Refinement of Vision-Language Models with Triangular Consistency
Title（参考訳）: 三角形の整合性を持つ視覚言語モデルの自己補正に向けて
Authors: Yunlong Deng, Guangyi Chen, Tianpei Gu, Lingjing Kong, Yan Li, Zeyu Tang, Kun Zhang,
Abstract要約: 視覚言語モデル(VLM)は、視覚知識と大規模言語モデル(LLM)の分析能力を統合する。本研究は,VLMが自給自給自給自給自足機能を有することを検証する。
参考スコア（独自算出の注目度）: 16.24217978112331
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language Models (VLMs) integrate visual knowledge with the analytical capabilities of Large Language Models (LLMs) through supervised visual instruction tuning, using image-question-answer triplets. However, the potential of VLMs trained without supervised instruction remains largely unexplored. This study validates that VLMs possess inherent self-refinement capabilities, enabling them to generate high-quality supervised data without external inputs and thereby learn autonomously. Specifically, to stimulate the self-refinement ability of VLMs, we propose a self-refinement framework based on a Triangular Consistency principle: within the image-query-answer triangle, any masked elements should be consistently and accurately reconstructed. The framework involves three steps: (1) We enable the instruction generation ability of VLMs by adding multi-task instruction tuning like image$\rightarrow$question-answer or image-answer$\rightarrow$question. (2) We generate image-query-answer triplets from unlabeled images and use the Triangular Consistency principle for filtering. (3) The model is further updated using the filtered synthetic data. To investigate the underlying mechanisms behind this self-refinement capability, we conduct a theoretical analysis from a causal perspective. Using the widely recognized LLaVA-1.5 as our baseline, our experiments reveal that the model can autonomously achieve consistent, though deliberately modest, improvements across multiple benchmarks without any external supervision, such as human annotations or environmental feedback. We expect that the insights of this study on the self-refinement ability of VLMs can inspire future research on the learning mechanism of VLMs. Code is available at https://github.com/dengyl20/SRF-LLaVA-1.5.
Abstract（参考訳）: VLM(Vision-Language Models)は、視覚知識とLLM(Large Language Models)の分析能力を統合する。しかしながら、教師なしで訓練されたVLMの可能性は、まだ明らかにされていない。本研究は,VLMが自給自給自足能力を持っていることを検証する。具体的には,VLMの自己補充能力を高めるために,三角形整合原理に基づく自己補充フレームワークを提案する。 1) image$\rightarrow$question-answer や image-answer$\rightarrow$question のようなマルチタスクの命令チューニングを追加することで、VLMの命令生成機能を有効にする。 2) ラベル付き画像から画像問合せ三重項を生成し, 三角整合原理を用いてフィルタリングを行う。 (3) フィルタリングされた合成データを用いてモデルをさらに更新する。本研究では,この自己補充能力の基盤となるメカニズムを解明するために,因果的観点から理論的解析を行う。我々の実験では、LLaVA-1.5をベースラインとして広く認識されており、人間のアノテーションや環境フィードバックなど外部の監視なしに、複数のベンチマークで自律的に一貫した改善を達成できることを示した。本研究では,VLMの自己抑制能力に関する知見が,VLMの学習メカニズムに関する今後の研究のきっかけとなることを期待する。コードはhttps://github.com/dengyl20/SRF-LLaVA-1.5で公開されている。

論文の概要: Towards Self-Refinement of Vision-Language Models with Triangular Consistency

関連論文リスト