Fugu-MT 論文翻訳(概要): Analyzing Modular Approaches for Visual Question Decomposition

論文の概要: Analyzing Modular Approaches for Visual Question Decomposition

arxiv url: http://arxiv.org/abs/2311.06411v1
Date: Fri, 10 Nov 2023 22:14:26 GMT
ステータス: 翻訳完了
システム内更新日: 2023-11-14 18:47:18.498983
Title: Analyzing Modular Approaches for Visual Question Decomposition
Title（参考訳）: 視覚的質問分解のためのモジュラーアプローチの解析
Authors: Apoorv Khandelwal, Ellie Pavlick, Chen Sun
Abstract要約: 追加トレーニングのないモジュラニューラルネットワークは、最近、視覚言語タスクでエンドツーエンドのニューラルネットワークを上回ることが示されている。本稿では、その追加性能がどこから来たのか、また、それが仮定する(最先端、エンドツーエンドの)BLIP-2モデルと、追加のシンボルコンポーネントとの違いについて尋ねる。 We found that ViperGPT's report gains over BLIP-2 may be due to its selection of task-specific modules, and we run ViperGPT using a task-agnostic selection of modules, and these gains away。
参考スコア（独自算出の注目度）: 38.73070270272822
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modular neural networks without additional training have recently been shown to surpass end-to-end neural networks on challenging vision-language tasks. The latest such methods simultaneously introduce LLM-based code generation to build programs and a number of skill-specific, task-oriented modules to execute them. In this paper, we focus on ViperGPT and ask where its additional performance comes from and how much is due to the (state-of-art, end-to-end) BLIP-2 model it subsumes vs. additional symbolic components. To do so, we conduct a controlled study (comparing end-to-end, modular, and prompting-based methods across several VQA benchmarks). We find that ViperGPT's reported gains over BLIP-2 can be attributed to its selection of task-specific modules, and when we run ViperGPT using a more task-agnostic selection of modules, these gains go away. Additionally, ViperGPT retains much of its performance if we make prominent alterations to its selection of modules: e.g. removing or retaining only BLIP-2. Finally, we compare ViperGPT against a prompting-based decomposition strategy and find that, on some benchmarks, modular approaches significantly benefit by representing subtasks with natural language, instead of code.
Abstract（参考訳）: 追加のトレーニングのないモジュール型ニューラルネットワークは、視覚言語課題においてエンドツーエンドのニューラルネットワークを上回っていることが最近示されている。最新の手法では、LLMベースのコード生成を同時に導入し、プログラムをビルドし、それを実行するためのスキル固有のタスク指向モジュールをいくつか導入している。本稿では, ViperGPT に焦点をあて,その追加性能がどこから来たのか,また,それが仮定する (最先端,エンドツーエンド) BLIP-2 モデルと,追加の記号的コンポーネントとの違いがどの程度なのかを問う。そのために、制御された研究(複数のVQAベンチマークでエンドツーエンド、モジュール、プロンプトベースの手法を比較)を行う。 blip-2に対するvipergptの報告された利益は、タスク固有のモジュールの選択に起因しており、よりタスクに依存しないモジュールの選択を使用してvipergptを実行すると、これらの利益は消滅する。さらに、ViperGPTは、BLIP-2のみを削除または保持するなど、モジュールの選択に顕著な変更を加えると、その性能を保っている。最後に、VierGPTとプロンプトベースの分解戦略を比較し、いくつかのベンチマークでは、コードではなく、自然言語でサブタスクを表現することで、モジュラーアプローチが大きなメリットがあることを見出した。

論文の概要: Analyzing Modular Approaches for Visual Question Decomposition

関連論文リスト