Related papers: VLHSA: Vision-Language Hierarchical Semantic Alignment for Jigsaw Puzzle Solving with Eroded Gaps

VLHSA: Vision-Language Hierarchical Semantic Alignment for Jigsaw Puzzle Solving with Eroded Gaps

URL: http://arxiv.org/abs/2509.25202v1
Date: Wed, 17 Sep 2025 20:40:52 GMT
Title: VLHSA: Vision-Language Hierarchical Semantic Alignment for Jigsaw Puzzle Solving with Eroded Gaps
Authors: Zhuoning Xu, Xinyan Liu,
Abstract summary: We propose a vision-language framework that leverages textual context to enhance puzzle assembly performance.<n>Our approach centers on the Vision-Language Hierarchical Semantic Alignment (VLHSA) module.<n>Our work establishes a new paradigm for jigsaw puzzle solving by incorporating multimodal semantic insights.
Score: 3.6380495892295173
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Jigsaw puzzle solving remains challenging in computer vision, requiring an understanding of both local fragment details and global spatial relationships. While most traditional approaches only focus on visual cues like edge matching and visual coherence, few methods explore natural language descriptions for semantic guidance in challenging scenarios, especially for eroded gap puzzles. We propose a vision-language framework that leverages textual context to enhance puzzle assembly performance. Our approach centers on the Vision-Language Hierarchical Semantic Alignment (VLHSA) module, which aligns visual patches with textual descriptions through multi-level semantic matching from local tokens to global context. Also, a multimodal architecture that combines dual visual encoders with language features for cross-modal reasoning is integrated into this module. Experiments demonstrate that our method significantly outperforms state-of-the-art models across various datasets, achieving substantial improvements, including a 14.2 percentage point gain in piece accuracy. Ablation studies confirm the critical role of the VLHSA module in driving improvements over vision-only approaches. Our work establishes a new paradigm for jigsaw puzzle solving by incorporating multimodal semantic insights.

Related papers

Revisiting Multi-Task Visual Representation Learning [52.93947931352643]
We introduce MTV, a principled multi-task visual pretraining framework.<n>We leverage high-capacity "expert" models to synthesize dense, structured pseudo-labels at scale.<n>Our results demonstrate that MTV achieves "best-of-both-worlds" performance.
arXiv Detail & Related papers (2026-01-20T11:59:19Z)
Eye-Q: A Multilingual Benchmark for Visual Word Puzzle Solving and Image-to-Phrase Reasoning [1.6234264741872295]
Vision-Language Models (VLMs) have achieved strong performance on standard vision-language benchmarks.<n>We propose visual word puzzles as a challenging alternative, as they require discovering implicit visual cues, generating and revising hypotheses, and mapping evidence to non-literal concepts.<n>We introduce Eye-Q, a multilingual benchmark designed to assess this form of complex visual understanding.
arXiv Detail & Related papers (2026-01-06T20:27:29Z)
A Multimodal Depth-Aware Method For Embodied Reference Understanding [56.30142869506262]
Embodied Reference Understanding requires identifying a target object in a visual scene based on both language instructions and pointing cues.<n>We propose a novel ERU framework that jointly leverages data augmentation, depth-map modality, and a depth-aware decision module.
arXiv Detail & Related papers (2025-10-09T14:32:21Z)
GLip: A Global-Local Integrated Progressive Framework for Robust Visual Speech Recognition [72.29071664964633]
We propose GLip, a Global-Local Integrated Progressive framework designed for robust visual speech recognition (VSR)<n>GLip learns to align both global and local visual features with corresponding acoustic speech units using easily accessible audio-visual data.<n>In the second stage, we introduce a Contextual Enhancement Module (CEM) to dynamically integrate local features with relevant global context.
arXiv Detail & Related papers (2025-09-19T14:36:01Z)
Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint [57.73346054360675]
Rebus puzzles, visual riddles that encode language through imagery, spatial arrangement, and symbolic substitution, pose a unique challenge to current vision-language models (VLMs)<n>In this paper, we investigate the capacity of contemporary VLMs to interpret and solve rebus puzzles by constructing a hand-generated and annotated benchmark of diverse English-language rebus puzzles.
arXiv Detail & Related papers (2025-05-29T17:59:47Z)
More Pictures Say More: Visual Intersection Network for Open Set Object Detection [4.206612461069489]
We introduce a strong DETR-based model, Visual Intersection Network for Open Set Object Detection (VINO) VINO constructs a multi-image visual bank to preserve the semantic intersections of each category across all time steps. Our approach guarantees a more precise alignment between target category semantics and region semantics, while significantly reducing pre-training time and resource demands.
arXiv Detail & Related papers (2024-08-26T05:52:35Z)
Exploring Interactive Semantic Alignment for Efficient HOI Detection with Vision-language Model [3.3772986620114387]
We introduce ISA-HOI, which extensively leverages knowledge from CLIP, aligning interactive semantics between visual and textual features. Our method achieves competitive results on the HICO-DET and V-COCO benchmarks with much fewer training epochs, and outperforms the state-of-the-art under zero-shot settings.
arXiv Detail & Related papers (2024-04-19T07:24:32Z)
Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects [11.117055725415446]
Large Vision Language Models (LVLMs) have demonstrated impressive zero-shot capabilities in various vision-language dialogue scenarios. The absence of fine-grained visual object detection hinders the model from understanding the details of images, leading to irreparable visual hallucinations and factual errors. We propose Lyrics, a novel multi-modal pre-training and instruction fine-tuning paradigm that bootstraps vision-language alignment from fine-grained cross-modal collaboration.
arXiv Detail & Related papers (2023-12-08T09:02:45Z)
Linguistic Query-Guided Mask Generation for Referring Image Segmentation [10.130530501400079]
Referring image segmentation aims to segment the image region of interest according to the given language expression. We propose an end-to-end framework built on transformer to perform Linguistic query-Guided mask generation.
arXiv Detail & Related papers (2023-01-16T13:38:22Z)
Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features. Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z)
Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks. Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts. We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z)
Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels. We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction. We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.