Related papers: Bongard-RWR+: Real-World Representations of Fine-Grained Concepts in Bongard Problems

Bongard-RWR+: Real-World Representations of Fine-Grained Concepts in Bongard Problems

URL: http://arxiv.org/abs/2508.12026v1
Date: Sat, 16 Aug 2025 12:26:44 GMT
Title: Bongard-RWR+: Real-World Representations of Fine-Grained Concepts in Bongard Problems
Authors: Szymon Pawlonka, Mikołaj Małkiński, Jacek Mańdziuk,
Abstract summary: Bongard Problems (BPs) provide a challenging testbed for abstract visual reasoning (AVR)<n>We introduce Bongard-RWR+, a dataset composed of $5,400$ instances that represent original BP abstract concepts using real-world-like images.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Bongard Problems (BPs) provide a challenging testbed for abstract visual reasoning (AVR), requiring models to identify visual concepts fromjust a few examples and describe them in natural language. Early BP benchmarks featured synthetic black-and-white drawings, which might not fully capture the complexity of real-world scenes. Subsequent BP datasets employed real-world images, albeit the represented concepts are identifiable from high-level image features, reducing the task complexity. Differently, the recently released Bongard-RWR dataset aimed at representing abstract concepts formulated in the original BPs using fine-grained real-world images. Its manual construction, however, limited the dataset size to just $60$ instances, constraining evaluation robustness. In this work, we introduce Bongard-RWR+, a BP dataset composed of $5\,400$ instances that represent original BP abstract concepts using real-world-like images generated via a vision language model (VLM) pipeline. Building on Bongard-RWR, we employ Pixtral-12B to describe manually curated images and generate new descriptions aligned with the underlying concepts, use Flux.1-dev to synthesize images from these descriptions, and manually verify that the generated images faithfully reflect the intended concepts. We evaluate state-of-the-art VLMs across diverse BP formulations, including binary and multiclass classification, as well as textual answer generation. Our findings reveal that while VLMs can recognize coarse-grained visual concepts, they consistently struggle with discerning fine-grained concepts, highlighting limitations in their reasoning capabilities.

Related papers

Referring Layer Decomposition [25.128453386102887]
We introduce the Referring Layer Decomposition (RLD) task, which predicts complete RGBA layers from a single RGB image.<n>At the core is the RefLade, a large-scale dataset comprising 1.11M image-layer-prompt triplets produced by our scalable data engine.<n>We present RefLayer, a simple baseline designed for prompt-conditioned layer decomposition, achieving high visual fidelity and semantic alignment.
arXiv Detail & Related papers (2026-02-22T22:05:17Z)
Perceive, Understand and Restore: Real-World Image Super-Resolution with Autoregressive Multimodal Generative Models [33.76031793753807]
We adapt the autoregressive multimodal model Lumina-mGPT into a robust Real-ISR model, namely PURE.<n>PURE Perceives and Understands the input low-quality image, then REstores its high-quality counterpart.<n> Experimental results demonstrate that PURE preserves image content while generating realistic details.
arXiv Detail & Related papers (2025-03-14T04:33:59Z)
Reasoning Limitations of Multimodal Large Language Models. A Case Study of Bongard Problems [0.0]
Bongard Problems (BPs) remain a key challenge in visual reasoning.<n>We investigate whether multimodal large language models (MLLMs) can solve BPs.<n>We introduce Bongard-RWR, a dataset representing synthetic BP concepts using real-world images.
arXiv Detail & Related papers (2024-11-02T08:06:30Z)
Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World [57.832261258993526]
Bongard-OpenWorld is a new benchmark for evaluating real-world few-shot reasoning for machine vision.<n>It already imposes a significant challenge to current few-shot reasoning algorithms.
arXiv Detail & Related papers (2023-10-16T09:19:18Z)
Towards Real-World Burst Image Super-Resolution: Benchmark and Method [93.73429028287038]
In this paper, we establish a large-scale real-world burst super-resolution dataset, i.e., RealBSR, to explore the faithful reconstruction of image details from multiple frames. We also introduce a Federated Burst Affinity network (FBAnet) to investigate non-trivial pixel-wise displacement among images under real-world image degradation.
arXiv Detail & Related papers (2023-09-09T14:11:37Z)
Does Visual Pretraining Help End-to-End Reasoning? [81.4707017038019]
We investigate whether end-to-end learning of visual reasoning can be achieved with general-purpose neural networks. We propose a simple and general self-supervised framework which "compresses" each video frame into a small set of tokens. We observe that pretraining is essential to achieve compositional generalization for end-to-end visual reasoning.
arXiv Detail & Related papers (2023-07-17T14:08:38Z)
Fully Context-Aware Image Inpainting with a Learned Semantic Pyramid [102.24539566851809]
Restoring reasonable and realistic content for arbitrary missing regions in images is an important yet challenging task. Recent image inpainting models have made significant progress in generating vivid visual details, but they can still lead to texture blurring or structural distortions. We propose the Semantic Pyramid Network (SPN) motivated by the idea that learning multi-scale semantic priors can greatly benefit the recovery of locally missing content in images.
arXiv Detail & Related papers (2021-12-08T04:33:33Z)
Palette: Image-to-Image Diffusion Models [50.268441533631176]
We introduce Palette, a simple and general framework for image-to-image translation using conditional diffusion models. On four challenging image-to-image translation tasks, Palette outperforms strong GAN and regression baselines. We report several sample quality scores including FID, Inception Score, Classification Accuracy of a pre-trained ResNet-50, and Perceptual Distance against reference images.
arXiv Detail & Related papers (2021-11-10T17:49:29Z)
Structural-analogy from a Single Image Pair [118.61885732829117]
In this paper, we explore the capabilities of neural networks to understand image structure given only a single pair of images, A and B. We generate an image that keeps the appearance and style of B, but has a structural arrangement that corresponds to A. Our method can be used to generate high quality imagery in other conditional generation tasks utilizing images A and B only.
arXiv Detail & Related papers (2020-04-05T14:51:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.