Fugu-MT 論文翻訳(概要): NLKI: A lightweight Natural Language Knowledge Integration Framework for Improving Small VLMs in Commonsense VQA Tasks

論文の概要: NLKI: A lightweight Natural Language Knowledge Integration Framework for Improving Small VLMs in Commonsense VQA Tasks

arxiv url: http://arxiv.org/abs/2508.19724v2
Date: Thu, 28 Aug 2025 12:05:33 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-29 13:55:31.756755
Title: NLKI: A lightweight Natural Language Knowledge Integration Framework for Improving Small VLMs in Commonsense VQA Tasks
Title（参考訳）: NLKI:Commonsense VQAタスクで小さなVLMを改善するための軽量自然言語知識統合フレームワーク
Authors: Aritra Dutta, Swapnanil Mukherjee, Deepanway Ghosal, Somak Aditya,
Abstract要約: ViLT、VisualBERT、FLAVAのような小さな視覚言語モデル(sVLM)は、より大きな生成言語に遅れを取っている。注意深いコモンセンス知識統合がsVLMに与える影響を検討するため,NLKI(End-to-end framework)を提案する。微調整されたColBERTv2とオブジェクト情報に富んだプログレッシブ・プロデュース・説明を用いて、幻覚をほとんど切断した。
参考スコア（独自算出の注目度）: 11.150587073510252
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Commonsense visual-question answering often hinges on knowledge that is missing from the image or the question. Small vision-language models (sVLMs) such as ViLT, VisualBERT and FLAVA therefore lag behind their larger generative counterparts. To study the effect of careful commonsense knowledge integration on sVLMs, we present an end-to-end framework (NLKI) that (i) retrieves natural language facts, (ii) prompts an LLM to craft natural language explanations, and (iii) feeds both signals to sVLMs respectively across two commonsense VQA datasets (CRIC, AOKVQA) and a visual-entailment dataset (e-SNLI-VE). Facts retrieved using a fine-tuned ColBERTv2 and an object information-enriched prompt yield explanations that largely cut down hallucinations, while lifting the end-to-end answer accuracy by up to 7% (across 3 datasets), making FLAVA and other models in NLKI match or exceed medium-sized VLMs such as Qwen-2 VL-2B and SmolVLM-2.5B. As these benchmarks contain 10-25% label noise, additional finetuning using noise-robust losses (such as symmetric cross entropy and generalised cross entropy) adds another 2.5% in CRIC, and 5.5% in AOKVQA. Our findings expose when LLM-based commonsense knowledge beats retrieval from commonsense knowledge bases, how noise-aware training stabilises small models in the context of external knowledge augmentation, and why parameter-efficient commonsense reasoning is now within reach for 250M models.
Abstract（参考訳）: 常識的な視覚的探求の答えは、画像や質問から欠落している知識に依存していることが多い。 ViLT、VisualBERT、FLAVAのような小さな視覚言語モデル(sVLM)は、それ故により大きな生成モデルよりも遅れている。注意深いコモンセンス知識統合がsVLMに与える影響を研究するため,NLKI(End-to-end framework)を提案する。 (i)自然言語の事実を検索する (二)LLMに自然言語の説明作成を促させ、 3)2つの共通センスVQAデータセット (CRIC, AOKVQA) と視覚情報データセット (e-SNLI-VE) をそれぞれsVLMに供給する。微調整されたColBERTv2とオブジェクト情報により取得されたファクトは、幻覚を大幅に減らし、エンドツーエンドの回答精度を最大7%引き上げ(3つのデータセットにわたって)、NLKIのFLAVAや他のモデルがQwen-2 VL-2BやSmolVLM-2.5Bのような中規模のVLMに一致するか、あるいは超えるようにした。これらのベンチマークには10-25%のラベルノイズが含まれているため、ノイズ-ローバースト損失(対称的クロスエントロピーや一般化されたクロスエントロピーなど)を使った微調整により、CRICの2.5%、AOKVQAの5.5%が追加されている。 LLMをベースとしたコモンセンス知識がコモンセンス知識ベースからの検索を上回り、ノイズ認識トレーニングが外部知識増強の文脈で小さなモデルを安定化させるか、そしてなぜパラメータ効率の良いコモンセンス推論が2億5000万モデルに到達したのかを明らかにする。

論文の概要: NLKI: A lightweight Natural Language Knowledge Integration Framework for Improving Small VLMs in Commonsense VQA Tasks

関連論文リスト