Solving Masked Jigsaw Puzzles with Diffusion Vision Transformers
- URL: http://arxiv.org/abs/2404.07292v1
- Date: Wed, 10 Apr 2024 18:40:23 GMT
- Title: Solving Masked Jigsaw Puzzles with Diffusion Vision Transformers
- Authors: Jinyang Liu, Wondmgezahu Teshome, Sandesh Ghimire, Mario Sznaier, Octavia Camps,
- Abstract summary: Image and video jigsaw puzzles pose the challenging task of rearranging image fragments or video frames from unordered sequences to restore meaningful images and video sequences.
Existing approaches often hinge on discriminative models tasked with predicting either the absolute positions of puzzle elements or the permutation actions applied to the original data.
We propose JPDVT, an innovative approach that harnesses diffusion transformers to address this challenge.
- Score: 5.374411622670979
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Solving image and video jigsaw puzzles poses the challenging task of rearranging image fragments or video frames from unordered sequences to restore meaningful images and video sequences. Existing approaches often hinge on discriminative models tasked with predicting either the absolute positions of puzzle elements or the permutation actions applied to the original data. Unfortunately, these methods face limitations in effectively solving puzzles with a large number of elements. In this paper, we propose JPDVT, an innovative approach that harnesses diffusion transformers to address this challenge. Specifically, we generate positional information for image patches or video frames, conditioned on their underlying visual content. This information is then employed to accurately assemble the puzzle pieces in their correct positions, even in scenarios involving missing pieces. Our method achieves state-of-the-art performance on several datasets.
Related papers
- Obtaining Favorable Layouts for Multiple Object Generation [50.616875565173274]
Large-scale text-to-image models can generate high-quality and diverse images based on textual prompts.
However, the existing state-of-the-art diffusion models face difficulty when generating images that involve multiple subjects.
We propose a novel approach based on a guiding principle. We allow the diffusion model to initially propose a layout, and then we rearrange the layout grid.
This is achieved by enforcing cross-attention maps (XAMs) to adhere to proposed masks and by migrating pixels from latent maps to new locations determined by us.
arXiv Detail & Related papers (2024-05-01T18:07:48Z) - MoviePuzzle: Visual Narrative Reasoning through Multimodal Order
Learning [54.73173491543553]
MoviePuzzle is a novel challenge that targets visual narrative reasoning and holistic movie understanding.
To tackle this quandary, we put forth MoviePuzzle task that amplifies the temporal feature learning and structure learning of video models.
Our approach outperforms existing state-of-the-art methods on the MoviePuzzle benchmark.
arXiv Detail & Related papers (2023-06-04T03:51:54Z) - Siamese Masked Autoencoders [76.35448665609998]
We present Siamese Masked Autoencoders (SiamMAE) for learning visual correspondence from videos.
SiamMAE operates on pairs of randomly sampled video frames and asymmetrically masks them.
It outperforms state-of-the-art self-supervised methods on video object segmentation, pose keypoint propagation, and semantic part propagation tasks.
arXiv Detail & Related papers (2023-05-23T17:59:46Z) - A Light Touch Approach to Teaching Transformers Multi-view Geometry [80.35521056416242]
We propose a "light touch" approach to guiding visual Transformers to learn multiple-view geometry.
We achieve this by using epipolar lines to guide the Transformer's cross-attention maps.
Unlike previous methods, our proposal does not require any camera pose information at test-time.
arXiv Detail & Related papers (2022-11-28T07:54:06Z) - Video Anomaly Detection by Solving Decoupled Spatio-Temporal Jigsaw
Puzzles [67.39567701983357]
Video Anomaly Detection (VAD) is an important topic in computer vision.
Motivated by the recent advances in self-supervised learning, this paper addresses VAD by solving an intuitive yet challenging pretext task.
Our method outperforms state-of-the-art counterparts on three public benchmarks.
arXiv Detail & Related papers (2022-07-20T19:49:32Z) - GANzzle: Reframing jigsaw puzzle solving as a retrieval task using a
generative mental image [15.132848477903314]
We infer a mental image from all pieces, which a given piece can then be matched against avoiding the explosion.
We learn how to reconstruct the image given a set of unordered pieces, allowing the model to learn a joint embedding space to match an encoding of each piece to the cropped layer of the generator.
In doing so our model is puzzle size agnostic, in contrast to prior deep learning methods which are single size.
arXiv Detail & Related papers (2022-07-12T16:02:00Z) - JigsawGAN: Self-supervised Learning for Solving Jigsaw Puzzles with
Generative Adversarial Networks [31.190344964881625]
The paper proposes a solution based on Generative Adversarial Network (GAN) for solving jigsaw puzzles.
The proposed method can solve jigsaw puzzles more efficiently by utilizing both semantic information and edge information simultaneously.
arXiv Detail & Related papers (2021-01-19T10:40:38Z) - Non-Rigid Puzzles [50.213265511586535]
We present a non-rigid multi-part shape matching algorithm.
We assume to be given a reference shape and its multiple parts undergoing a non-rigid deformation.
Experimental results on synthetic as well as real scans demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2020-11-26T00:32:30Z) - Pictorial and apictorial polygonal jigsaw puzzles: The lazy caterer
model, properties, and solvers [14.08706290287121]
We formalize a new type of jigsaw puzzle where the pieces are general convex polygons generated by cutting through a global polygonal shape/image with an arbitrary number of straight cuts.
We analyze the theoretical properties of such puzzles, including the inherent challenges in solving them once pieces are contaminated with geometrical noise.
arXiv Detail & Related papers (2020-08-17T22:07:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.