Fugu-MT 論文翻訳(概要): Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World

論文の概要: Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World

arxiv url: http://arxiv.org/abs/2310.10207v4
Date: Tue, 12 Mar 2024 10:57:49 GMT
ステータス: 翻訳完了
システム内更新日: 2024-03-13 15:53:02.699066
Title: Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World
Title（参考訳）: Bongard-OpenWorld: 現実の世界における自由な視覚概念のためのFew-Shot Reasoning
Authors: Rujie Wu, Xiaojian Ma, Zhenliang Zhang, Wei Wang, Qing Li, Song-Chun Zhu, Yizhou Wang
Abstract要約: Bongard-OpenWorldは、マシンビジョンの実際の数ショット推論を評価するための新しいベンチマークである。これは、現在の数発の推論アルゴリズムにすでに大きな課題を課している。
参考スコア（独自算出の注目度）: 60.73230167638598
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce Bongard-OpenWorld, a new benchmark for evaluating real-world few-shot reasoning for machine vision. It originates from the classical Bongard Problems (BPs): Given two sets of images (positive and negative), the model needs to identify the set that query images belong to by inducing the visual concepts, which is exclusively depicted by images from the positive set. Our benchmark inherits the few-shot concept induction of the original BPs while adding the two novel layers of challenge: 1) open-world free-form concepts, as the visual concepts in Bongard-OpenWorld are unique compositions of terms from an open vocabulary, ranging from object categories to abstract visual attributes and commonsense factual knowledge; 2) real-world images, as opposed to the synthetic diagrams used by many counterparts. In our exploration, Bongard-OpenWorld already imposes a significant challenge to current few-shot reasoning algorithms. We further investigate to which extent the recently introduced Large Language Models (LLMs) and Vision-Language Models (VLMs) can solve our task, by directly probing VLMs, and combining VLMs and LLMs in an interactive reasoning scheme. We even conceived a neuro-symbolic reasoning approach that reconciles LLMs & VLMs with logical reasoning to emulate the human problem-solving process for Bongard Problems. However, none of these approaches manage to close the human-machine gap, as the best learner achieves 64% accuracy while human participants easily reach 91%. We hope Bongard-OpenWorld can help us better understand the limitations of current visual intelligence and facilitate future research on visual agents with stronger few-shot visual reasoning capabilities.
Abstract（参考訳）: Bongard-OpenWorldは、マシンビジョンのための実世界の数ショット推論を評価するための新しいベンチマークである。古典的なボナード問題(BP)に由来する: 2つのイメージセット(正と負の)が与えられたモデルでは、クエリイメージが属する集合を正の集合からのみ描写される視覚概念を誘導することによって識別する必要がある。我々のベンチマークは、最初のBPのいくつかの概念を継承し、新しい2つの課題を追加している。 1)bongard-openworldの視覚概念は,オブジェクトのカテゴリから抽象的な視覚属性,常識的な事実知識まで,オープンボキャブラリから用語のユニークな構成である。 2) 実世界の画像は,多くの対数で使用される合成図と対照的である。私たちの調査では、bongard-openworldはすでに、現在の少数ショット推論アルゴリズムに重大な課題を課しています。さらに,最近導入されたLarge Language Models (LLMs) とVision-Language Models (VLMs) が,VLMを直接探索し,VLMとLLMを対話型推論方式で組み合わせることで,その課題をどの程度解決できるかについても検討する。ボナード問題に対する人間の問題解決過程をエミュレートするために,LLMとVLMを論理的推論で再現する神経象徴的推論手法も考案した。しかし、最良の学習者は64%の精度を達成し、人間の参加者は91%に到達し易いため、これらのアプローチはいずれも人間と機械のギャップを埋めるには至らなかった。 bongard-openworldは、現在の視覚知能の限界をより深く理解し、より強力な少数ショットの視覚推論能力を持つ視覚エージェントに関する将来の研究を促進するのに役立つことを願っている。

論文の概要: Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World

関連論文リスト