Seq2Seq Models Reconstruct Visual Jigsaw Puzzles without Seeing Them
- URL: http://arxiv.org/abs/2511.06315v1
- Date: Sun, 09 Nov 2025 10:43:16 GMT
- Title: Seq2Seq Models Reconstruct Visual Jigsaw Puzzles without Seeing Them
- Authors: Gur Elkn, Ofir Itzhak Shahar, Ohad Ben-Shahar,
- Abstract summary: We introduce a specialized tokenizer that converts each puzzle piece into a discrete sequence of tokens.<n>Treated as "blind" solvers, encoder-decoder transformers accurately reconstruct the original layout.<n>Despite being deliberately restricted from accessing visual input, our models achieve state-of-the-art results.
- Score: 2.8834483859625952
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Jigsaw puzzles are primarily visual objects, whose algorithmic solutions have traditionally been framed from a visual perspective. In this work, however, we explore a fundamentally different approach: solving square jigsaw puzzles using language models, without access to raw visual input. By introducing a specialized tokenizer that converts each puzzle piece into a discrete sequence of tokens, we reframe puzzle reassembly as a sequence-to-sequence prediction task. Treated as "blind" solvers, encoder-decoder transformers accurately reconstruct the original layout by reasoning over token sequences alone. Despite being deliberately restricted from accessing visual input, our models achieve state-of-the-art results across multiple benchmarks, often outperforming vision-based methods. These findings highlight the surprising capability of language models to solve problems beyond their native domain, and suggest that unconventional approaches can inspire promising directions for puzzle-solving research.
Related papers
- Solving Convex Partition Visual Jigsaw Puzzles [3.0427549266235125]
Jigsaw puzzle solving requires rearrangement of unordered pieces into their original pose in order to reconstruct a coherent whole.<n>Most of the literature has focused on developing solvers for square jigsaw puzzles, severely limiting their practical use.<n>In this work, we significantly expand the types of puzzles handled computationally, focusing on what is known as Convex Partitions.
arXiv Detail & Related papers (2025-11-06T15:22:46Z) - PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts [47.92619068073141]
We introduce PuzzleWorld, a large-scale benchmark of 667 puzzlehunt-style problems designed to assess step-by-step, open-ended, and creative multimodal reasoning.<n>Most state-of-the-art models achieve only 1-2% final answer accuracy, with the best model solving only 14% of puzzles and reaching 40% stepwise accuracy.<n>Our error analysis reveals that current models exhibit myopic reasoning, are bottlenecked by the limitations of language-based inference, and lack sketching capabilities crucial for visual and spatial reasoning.
arXiv Detail & Related papers (2025-06-06T16:17:09Z) - Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint [57.73346054360675]
Rebus puzzles, visual riddles that encode language through imagery, spatial arrangement, and symbolic substitution, pose a unique challenge to current vision-language models (VLMs)<n>In this paper, we investigate the capacity of contemporary VLMs to interpret and solve rebus puzzles by constructing a hand-generated and annotated benchmark of diverse English-language rebus puzzles.
arXiv Detail & Related papers (2025-05-29T17:59:47Z) - Solving Masked Jigsaw Puzzles with Diffusion Vision Transformers [5.374411622670979]
Image and video jigsaw puzzles pose the challenging task of rearranging image fragments or video frames from unordered sequences to restore meaningful images and video sequences.
Existing approaches often hinge on discriminative models tasked with predicting either the absolute positions of puzzle elements or the permutation actions applied to the original data.
We propose JPDVT, an innovative approach that harnesses diffusion transformers to address this challenge.
arXiv Detail & Related papers (2024-04-10T18:40:23Z) - Are Language Models Puzzle Prodigies? Algorithmic Puzzles Unveil Serious
Challenges in Multimodal Reasoning [24.386388107656334]
This paper introduces the novel task of multimodal puzzle solving, framed within the context of visual question-answering.
We present a new dataset, AlgoVQA, designed to challenge and evaluate the capabilities of multimodal language models in solving algorithmic puzzles.
arXiv Detail & Related papers (2024-03-06T17:15:04Z) - Multi-Phase Relaxation Labeling for Square Jigsaw Puzzle Solving [73.58829980121767]
We present a novel method for solving square jigsaw puzzles based on global optimization.
The method is fully automatic, assumes no prior information, and can handle puzzles with known or unknown piece orientation.
arXiv Detail & Related papers (2023-03-26T18:53:51Z) - Automated Graph Genetic Algorithm based Puzzle Validation for Faster
Game Desig [69.02688684221265]
This paper presents an evolutionary algorithm, empowered by expert-knowledge informeds, for solving logical puzzles in video games efficiently.
We discuss multiple variations of hybrid genetic approaches for constraint satisfaction problems that allow us to find a diverse set of near-optimal solutions for puzzles.
arXiv Detail & Related papers (2023-02-17T18:15:33Z) - Video Anomaly Detection by Solving Decoupled Spatio-Temporal Jigsaw
Puzzles [67.39567701983357]
Video Anomaly Detection (VAD) is an important topic in computer vision.
Motivated by the recent advances in self-supervised learning, this paper addresses VAD by solving an intuitive yet challenging pretext task.
Our method outperforms state-of-the-art counterparts on three public benchmarks.
arXiv Detail & Related papers (2022-07-20T19:49:32Z) - GANzzle: Reframing jigsaw puzzle solving as a retrieval task using a
generative mental image [15.132848477903314]
We infer a mental image from all pieces, which a given piece can then be matched against avoiding the explosion.
We learn how to reconstruct the image given a set of unordered pieces, allowing the model to learn a joint embedding space to match an encoding of each piece to the cropped layer of the generator.
In doing so our model is puzzle size agnostic, in contrast to prior deep learning methods which are single size.
arXiv Detail & Related papers (2022-07-12T16:02:00Z) - Graph Jigsaw Learning for Cartoon Face Recognition [79.29656077338828]
It is difficult to learn a shape-oriented representation for cartoon face recognition with convolutional neural networks (CNNs)
We propose the GraphJigsaw that constructs jigsaw puzzles at various stages in the classification network and solves the puzzles with the graph convolutional network (GCN) in a progressive manner.
Our proposed GraphJigsaw consistently outperforms other face recognition or jigsaw-based methods on two popular cartoon face datasets.
arXiv Detail & Related papers (2021-07-14T08:01:06Z) - Pictorial and apictorial polygonal jigsaw puzzles: The lazy caterer
model, properties, and solvers [14.08706290287121]
We formalize a new type of jigsaw puzzle where the pieces are general convex polygons generated by cutting through a global polygonal shape/image with an arbitrary number of straight cuts.
We analyze the theoretical properties of such puzzles, including the inherent challenges in solving them once pieces are contaminated with geometrical noise.
arXiv Detail & Related papers (2020-08-17T22:07:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.