Related papers: A Sketch+Text Composed Image Retrieval Dataset for Thangka

A Sketch+Text Composed Image Retrieval Dataset for Thangka

URL: http://arxiv.org/abs/2602.08411v1
Date: Mon, 09 Feb 2026 09:14:29 GMT
Title: A Sketch+Text Composed Image Retrieval Dataset for Thangka
Authors: Jinyu Xu, Yi Sun, Jiangling Zhang, Qing Xie, Daomin Ji, Zhifeng Bao, Jiachen Li, Yanchun Ma, Yongjian Liu,
Abstract summary: Composed Image Retrieval (CIR) enables image retrieval by combining multiple query modalities.<n>CIRThan is a sketch+text Composed Image Retrieval dataset for Thangka imagery.
Score: 14.600552992453977
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Composed Image Retrieval (CIR) enables image retrieval by combining multiple query modalities, but existing benchmarks predominantly focus on general-domain imagery and rely on reference images with short textual modifications. As a result, they provide limited support for retrieval scenarios that require fine-grained semantic reasoning, structured visual understanding, and domain-specific knowledge. In this work, we introduce CIRThan, a sketch+text Composed Image Retrieval dataset for Thangka imagery, a culturally grounded and knowledge-specific visual domain characterized by complex structures, dense symbolic elements, and domain-dependent semantic conventions. CIRThan contains 2,287 high-quality Thangka images, each paired with a human-drawn sketch and hierarchical textual descriptions at three semantic levels, enabling composed queries that jointly express structural intent and multi-level semantic specification. We provide standardized data splits, comprehensive dataset analysis, and benchmark evaluations of representative supervised and zero-shot CIR methods. Experimental results reveal that existing CIR approaches, largely developed for general-domain imagery, struggle to effectively align sketch-based abstractions and hierarchical textual semantics with fine-grained Thangka images, particularly without in-domain supervision. We believe CIRThan offers a valuable benchmark for advancing sketch+text CIR, hierarchical semantic modeling, and multimodal retrieval in cultural heritage and other knowledge-specific visual domains. The dataset is publicly available at https://github.com/jinyuxu-whut/CIRThan.

Related papers

Fine-Grained Zero-Shot Composed Image Retrieval with Complementary Visual-Semantic Integration [64.12127577975696]
Zero-shot composed image retrieval (ZS-CIR) is a rapidly growing area with significant practical applications.<n>Existing ZS-CIR methods often struggle to capture fine-grained changes and integrate visual and semantic information effectively.<n>We propose a novel Fine-Grained Zero-Shot Composed Image Retrieval method with Complementary Visual-Semantic Integration.
arXiv Detail & Related papers (2026-01-20T15:17:14Z)
Entity-Guided Multi-Task Learning for Infrared and Visible Image Fusion [14.3937321254743]
We propose a novel fusion approach named Entity-Guided Multi-Task learning for infrared and visible image fusion (EGMT)<n>A principled method is proposed to extract entity-level textual information from image captions generated by large vision-language models.<n>A parallel multi-task learning architecture is constructed, which integrates image fusion with a multi-label classification task.<n>An entity-guided cross-modal interactive module is also developed to facilitate the fine-grained interaction between visual and entity-level textual features.
arXiv Detail & Related papers (2026-01-05T08:00:03Z)
Through the PRISm: Importance-Aware Scene Graphs for Image Retrieval [6.804414686833417]
PRISm is a multimodal framework that advances image-to-image retrieval through two novel components.<n>The Importance Prediction Module identifies and retains the most critical objects and relational triplets within an image.<n>The Edge-Aware Graph Neural Network explicitly encodes relational structure and integrates global visual features to produce semantically informed image embeddings.
arXiv Detail & Related papers (2025-12-20T15:57:46Z)
Text-based Aerial-Ground Person Retrieval [55.31140361809554]
This work introduces Text-based Aerial-Ground Person Retrieval (TAG-PR)<n>It aims to retrieve person images from heterogeneous aerial and ground views with textual descriptions.
arXiv Detail & Related papers (2025-11-11T15:49:04Z)
ComAlign: Compositional Alignment in Vision-Language Models [2.3250871476216814]
We introduce Compositional Alignment (ComAlign) to discover more exact correspondence of text and image components. Our methodology emphasizes that the compositional structure extracted from the text modality must also be retained in the image modality. We train a lightweight network lying on top of existing visual and language encoders using a small dataset.
arXiv Detail & Related papers (2024-09-12T16:46:41Z)
LuoJiaHOG: A Hierarchy Oriented Geo-aware Image Caption Dataset for Remote Sensing Image-Text Retrival [8.656768875730904]
We introduce an image caption dataset LuojiaHOG, which is geospatial-aware, label-extension-friendly and comprehensive-captioned. LuojiaHOG involves the hierarchical spatial sampling, classification system to Open Geospatial Consortium (OGC) standards, and detailed caption generation. We also propose a CLIP-based Image Semantic Enhancement Network (CISEN) to promote sophisticated ITR.
arXiv Detail & Related papers (2024-03-16T10:46:14Z)
SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval [64.03631654052445]
Current benchmarks for evaluating MMIR performance in image-text pairing within the scientific domain show a notable gap. We develop a specialised scientific MMIR benchmark by leveraging open-access paper collections. This benchmark comprises 530K meticulously curated image-text pairs, extracted from figures and tables with detailed captions in scientific documents.
arXiv Detail & Related papers (2024-01-24T14:23:12Z)
HGAN: Hierarchical Graph Alignment Network for Image-Text Retrieval [13.061063817876336]
We propose a novel Hierarchical Graph Alignment Network (HGAN) for image-text retrieval. First, to capture the comprehensive multimodal features, we construct the feature graphs for the image and text modality respectively. Then, a multi-granularity shared space is established with a designed Multi-granularity Feature Aggregation and Rearrangement (MFAR) module. Finally, the ultimate image and text features are further refined through three-level similarity functions to achieve the hierarchical alignment.
arXiv Detail & Related papers (2022-12-16T05:08:52Z)
CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS) CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment. Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
Matching Visual Features to Hierarchical Semantic Topics for Image Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework. To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network. To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z)
Consensus-Aware Visual-Semantic Embedding for Image-Text Matching [69.34076386926984]
Image-text matching plays a central role in bridging vision and language. Most existing approaches only rely on the image-text instance pair to learn their representations. We propose a Consensus-aware Visual-Semantic Embedding model to incorporate the consensus information.
arXiv Detail & Related papers (2020-07-17T10:22:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.