GenVP: Generating Visual Puzzles with Contrastive Hierarchical VAEs
- URL: http://arxiv.org/abs/2503.23598v1
- Date: Sun, 30 Mar 2025 21:35:26 GMT
- Title: GenVP: Generating Visual Puzzles with Contrastive Hierarchical VAEs
- Authors: Kalliopi Basioti, Pritish Sahu, Qingze Tony Liu, Zihao Xu, Hao Wang, Vladimir Pavlovic,
- Abstract summary: We propose Generative Visual Puzzles (GenVP) to model the entire RPM generation process.<n>Our model's capability spans from generating multiple solutions for one specific problem prompt to creating complete new puzzles out of the desired set of rules.
- Score: 15.118234858274679
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Raven's Progressive Matrices (RPMs) is an established benchmark to examine the ability to perform high-level abstract visual reasoning (AVR). Despite the current success of algorithms that solve this task, humans can generalize beyond a given puzzle and create new puzzles given a set of rules, whereas machines remain locked in solving a fixed puzzle from a curated choice list. We propose Generative Visual Puzzles (GenVP), a framework to model the entire RPM generation process, a substantially more challenging task. Our model's capability spans from generating multiple solutions for one specific problem prompt to creating complete new puzzles out of the desired set of rules. Experiments on five different datasets indicate that GenVP achieves state-of-the-art (SOTA) performance both in puzzle-solving accuracy and out-of-distribution (OOD) generalization in 22 OOD scenarios. Compared to SOTA generative approaches, which struggle to solve RPMs when the feasible solution space increases, GenVP efficiently generalizes to these challenging setups. Moreover, our model demonstrates the ability to produce a wide range of complete RPMs given a set of abstract rules by effectively capturing the relationships between abstract rules and visual object properties.
Related papers
- PuzzleBench: A Fully Dynamic Evaluation Framework for Large Multimodal Models on Puzzle Solving [50.50405233978406]
We propose a fully dynamic multimodal evaluation framework, named Open-ended Visual Puzzle Generation (OVPG)
OVPG aims to generate fresh, diverse, and verifiable evaluation data automatically in puzzle-solving tasks.
Built upon OVPG, we construct PuzzleBench, a dynamic and scalable benchmark comprising 11,840 VQA samples.
arXiv Detail & Related papers (2025-04-15T05:29:31Z) - ERL-MPP: Evolutionary Reinforcement Learning with Multi-head Puzzle Perception for Solving Large-scale Jigsaw Puzzles of Eroded Gaps [28.009783235854584]
We propose a framework of Evolutionary Reinforcement Learning with Multi-head Puzzle Perception.
The proposed ERL-MPP is evaluated on the JPLEG-5 dataset with large gaps and the MIT dataset with large-scale puzzles.
It significantly outperforms all state-of-the-art models on both datasets.
arXiv Detail & Related papers (2025-04-13T14:56:41Z) - Improving Retrieval-Augmented Generation through Multi-Agent Reinforcement Learning [51.54046200512198]
Retrieval-augmented generation (RAG) is extensively utilized to incorporate external, current knowledge into large language models.
A standard RAG pipeline may comprise several components, such as query rewriting, document retrieval, document filtering, and answer generation.
To overcome these challenges, we propose treating the RAG pipeline as a multi-agent cooperative task, with each component regarded as an RL agent.
arXiv Detail & Related papers (2025-01-25T14:24:50Z) - Generative AI Enabled Matching for 6G Multiple Access [51.00960374545361]
We propose a GenAI-enabled matching generation framework to support 6G multiple access.
We show that our framework can generate more effective matching strategies based on given conditions and predefined rewards.
arXiv Detail & Related papers (2024-10-29T13:01:26Z) - Are Language Models Puzzle Prodigies? Algorithmic Puzzles Unveil Serious
Challenges in Multimodal Reasoning [24.386388107656334]
This paper introduces the novel task of multimodal puzzle solving, framed within the context of visual question-answering.
We present a new dataset, AlgoVQA, designed to challenge and evaluate the capabilities of multimodal language models in solving algorithmic puzzles.
arXiv Detail & Related papers (2024-03-06T17:15:04Z) - Towards Generative Abstract Reasoning: Completing Raven's Progressive Matrix via Rule Abstraction and Selection [52.107043437362556]
Raven's Progressive Matrix (RPM) is widely used to probe abstract visual reasoning in machine intelligence.
Participators of RPM tests can show powerful reasoning ability by inferring and combining attribute-changing rules.
We propose a deep latent variable model for answer generation problems through Rule AbstractIon and SElection.
arXiv Detail & Related papers (2024-01-18T13:28:44Z) - Are Deep Neural Networks SMARTer than Second Graders? [85.60342335636341]
We evaluate the abstraction, deduction, and generalization abilities of neural networks in solving visuo-linguistic puzzles designed for children in the 6--8 age group.
Our dataset consists of 101 unique puzzles; each puzzle comprises a picture question, and their solution needs a mix of several elementary skills, including arithmetic, algebra, and spatial reasoning.
Experiments reveal that while powerful deep models offer reasonable performances on puzzles in a supervised setting, they are not better than random accuracy when analyzed for generalization.
arXiv Detail & Related papers (2022-12-20T04:33:32Z) - Video Anomaly Detection by Solving Decoupled Spatio-Temporal Jigsaw
Puzzles [67.39567701983357]
Video Anomaly Detection (VAD) is an important topic in computer vision.
Motivated by the recent advances in self-supervised learning, this paper addresses VAD by solving an intuitive yet challenging pretext task.
Our method outperforms state-of-the-art counterparts on three public benchmarks.
arXiv Detail & Related papers (2022-07-20T19:49:32Z) - GANzzle: Reframing jigsaw puzzle solving as a retrieval task using a
generative mental image [15.132848477903314]
We infer a mental image from all pieces, which a given piece can then be matched against avoiding the explosion.
We learn how to reconstruct the image given a set of unordered pieces, allowing the model to learn a joint embedding space to match an encoding of each piece to the cropped layer of the generator.
In doing so our model is puzzle size agnostic, in contrast to prior deep learning methods which are single size.
arXiv Detail & Related papers (2022-07-12T16:02:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.