Fugu-MT 論文翻訳(概要): Bongard-RWR+: Real-World Representations of Fine-Grained Concepts in Bongard Problems

論文の概要: Bongard-RWR+: Real-World Representations of Fine-Grained Concepts in Bongard Problems

arxiv url: http://arxiv.org/abs/2508.12026v1
Date: Sat, 16 Aug 2025 12:26:44 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-19 14:49:10.518737
Title: Bongard-RWR+: Real-World Representations of Fine-Grained Concepts in Bongard Problems
Title（参考訳）: Bongard-RWR+:Bongard問題におけるファイングラインド概念の実世界表現
Authors: Szymon Pawlonka, Mikołaj Małkiński, Jacek Mańdziuk,
Abstract要約: ボンガード問題(BP)は抽象的視覚推論(AVR)のための挑戦的なテストベッドを提供する Bongard-RWR+は5,400ドルのインスタンスからなるデータセットで、実世界の画像を用いてBPの抽象概念を表現している。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Bongard Problems (BPs) provide a challenging testbed for abstract visual reasoning (AVR), requiring models to identify visual concepts fromjust a few examples and describe them in natural language. Early BP benchmarks featured synthetic black-and-white drawings, which might not fully capture the complexity of real-world scenes. Subsequent BP datasets employed real-world images, albeit the represented concepts are identifiable from high-level image features, reducing the task complexity. Differently, the recently released Bongard-RWR dataset aimed at representing abstract concepts formulated in the original BPs using fine-grained real-world images. Its manual construction, however, limited the dataset size to just $60$ instances, constraining evaluation robustness. In this work, we introduce Bongard-RWR+, a BP dataset composed of $5\,400$ instances that represent original BP abstract concepts using real-world-like images generated via a vision language model (VLM) pipeline. Building on Bongard-RWR, we employ Pixtral-12B to describe manually curated images and generate new descriptions aligned with the underlying concepts, use Flux.1-dev to synthesize images from these descriptions, and manually verify that the generated images faithfully reflect the intended concepts. We evaluate state-of-the-art VLMs across diverse BP formulations, including binary and multiclass classification, as well as textual answer generation. Our findings reveal that while VLMs can recognize coarse-grained visual concepts, they consistently struggle with discerning fine-grained concepts, highlighting limitations in their reasoning capabilities.
Abstract（参考訳）: ボンガード問題(BP)は抽象的な視覚的推論(AVR)のための挑戦的なテストベッドを提供する。初期のBPベンチマークでは、合成された白黒の描画が特徴で、現実世界のシーンの複雑さを完全には捉えられないかもしれない。その後のBPデータセットは実世界のイメージを使用していたが、表現された概念は高レベルな画像の特徴から識別可能であり、タスクの複雑さが軽減される。異なることに、先日リリースされたBongard-RWRデータセットは、粒度の細かい実世界のイメージを使用して、オリジナルのBPで定式化された抽象概念を表現することを目的としている。しかし、手作業による構成はデータセットのサイズを60ドルのインスタンスに制限し、評価の堅牢性を制限した。本稿では,視覚言語モデル(VLM)パイプラインを用いて生成した実世界のようなイメージを用いて,BP抽象概念を表す5,400ドルのインスタンスからなるBPデータセットであるBongard-RWR+を紹介する。 Bongard-RWR 上に構築した Pixtral-12B を用いて、手動でキュレートされた画像を記述し、基礎となる概念と整合した新しい記述を生成し、 Flux.1-dev を用いてこれらの記述から画像を合成し、生成した画像が意図した概念を忠実に反映していることを確認する。両クラス分類,多クラス分類,テキスト応答生成など,BP の様々な定式化における最先端 VLM の評価を行った。以上の結果から,VLMは粗粒度の概念を認識できるが,細粒度概念の識別に常に苦労し,推論能力の限界を浮き彫りにしていることがわかった。

論文の概要: Bongard-RWR+: Real-World Representations of Fine-Grained Concepts in Bongard Problems

関連論文リスト