Fugu-MT 論文翻訳(概要): CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

論文の概要: CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

arxiv url: http://arxiv.org/abs/2311.03354v1
Date: Mon, 6 Nov 2023 18:59:44 GMT
ステータス: 翻訳完了
システム内更新日: 2023-11-07 13:12:04.197180
Title: CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding
Title（参考訳）: covlm: コミュニケーション型デコードによる大規模言語モデルにおける視覚エンティティと関係の構成
Authors: Junyan Li, Delin Chen, Yining Hong, Zhenfang Chen, Peihao Chen, Yikang Shen, Chuang Gan
Abstract要約: CoVLM は LLM を誘導して、テキスト間の視覚的実体と関係を明示的に構成することができる。テキスト間の視覚的実体と関係を明示的に構成するために,LLM をガイドする CoVLM を提案する。
参考スコア（独自算出の注目度）: 66.52659447360104
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A remarkable ability of human beings resides in compositional reasoning, i.e., the capacity to make "infinite use of finite means". However, current large vision-language foundation models (VLMs) fall short of such compositional abilities due to their "bag-of-words" behaviors and inability to construct words that correctly represent visual entities and the relations among the entities. To this end, we propose CoVLM, which can guide the LLM to explicitly compose visual entities and relationships among the text and dynamically communicate with the vision encoder and detection network to achieve vision-language communicative decoding. Specifically, we first devise a set of novel communication tokens for the LLM, for dynamic communication between the visual detection system and the language system. A communication token is generated by the LLM following a visual entity or a relation, to inform the detection network to propose regions that are relevant to the sentence generated so far. The proposed regions-of-interests (ROIs) are then fed back into the LLM for better language generation contingent on the relevant regions. The LLM is thus able to compose the visual entities and relationships through the communication tokens. The vision-to-language and language-to-vision communication are iteratively performed until the entire sentence is generated. Our framework seamlessly bridges the gap between visual perception and LLMs and outperforms previous VLMs by a large margin on compositional reasoning benchmarks (e.g., ~20% in HICO-DET mAP, ~14% in Cola top-1 accuracy, and ~3% on ARO top-1 accuracy). We also achieve state-of-the-art performances on traditional vision-language tasks such as referring expression comprehension and visual question answering.
Abstract（参考訳）: 人間の顕著な能力は、構成的推論、すなわち「有限な手段を無限に利用する」能力に存在する。しかし、現在の大きな視覚言語基盤モデル(vlms)は、その「言葉の袋」の振る舞いと、視覚エンティティとエンティティ間の関係を正しく表現する単語を構築できないため、そのような構成能力に欠けている。そこで本研究では,LLMがテキスト間の視覚的実体や関係を明示的に構成し,視覚エンコーダや検出ネットワークと動的に通信することで,視覚言語通信復号を実現するためのCoVLMを提案する。具体的には,視覚検出システムと言語システム間の動的通信のための,LLMのための新しい通信トークンセットを最初に考案する。視覚的実体または関係に従ってLLMにより通信トークンを生成し、検出ネットワークに通知し、これまで生成された文に関連する領域を提案する。提案された関心領域(roi)は、関連する地域に関するより良い言語生成のためにllmに返される。 LLMは通信トークンを通じて視覚的実体と関係を構成することができる。文全体を生成するまで、言語間および言語間通信を反復的に行う。我々のフレームワークは視覚的知覚とLLMのギャップをシームレスに橋渡しし、構成的推論ベンチマーク(HICO-DET mAPでは20%、Cola top-1の精度では14%、ARO top-1の精度では3%)で以前のVLMよりも優れています。また,表情理解や視覚的質問応答など,従来の視覚言語タスクにおける最先端のパフォーマンスを実現する。

論文の概要: CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

関連論文リスト