Zero-Shot Everything Sketch-Based Image Retrieval, and in Explainable
Style
- URL: http://arxiv.org/abs/2303.14348v1
- Date: Sat, 25 Mar 2023 03:52:32 GMT
- Title: Zero-Shot Everything Sketch-Based Image Retrieval, and in Explainable
Style
- Authors: Fengyin Lin, Mingkang Li, Da Li, Timothy Hospedales, Yi-Zhe Song,
Yonggang Qi
- Abstract summary: This paper studies the problem of zero-short sketch-based image retrieval (ZS-SBIR)
Key innovation lies with the realization that such a cross-modal matching problem could be reduced to comparisons of groups of key local patches.
Experiments show ours indeed delivers superior performances across all ZS-SBIR settings.
- Score: 40.112168046676125
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper studies the problem of zero-short sketch-based image retrieval
(ZS-SBIR), however with two significant differentiators to prior art (i) we
tackle all variants (inter-category, intra-category, and cross datasets) of
ZS-SBIR with just one network (``everything''), and (ii) we would really like
to understand how this sketch-photo matching operates (``explainable''). Our
key innovation lies with the realization that such a cross-modal matching
problem could be reduced to comparisons of groups of key local patches -- akin
to the seasoned ``bag-of-words'' paradigm. Just with this change, we are able
to achieve both of the aforementioned goals, with the added benefit of no
longer requiring external semantic knowledge. Technically, ours is a
transformer-based cross-modal network, with three novel components (i) a
self-attention module with a learnable tokenizer to produce visual tokens that
correspond to the most informative local regions, (ii) a cross-attention module
to compute local correspondences between the visual tokens across two
modalities, and finally (iii) a kernel-based relation network to assemble local
putative matches and produce an overall similarity metric for a sketch-photo
pair. Experiments show ours indeed delivers superior performances across all
ZS-SBIR settings. The all important explainable goal is elegantly achieved by
visualizing cross-modal token correspondences, and for the first time, via
sketch to photo synthesis by universal replacement of all matched photo
patches. Code and model are available at
\url{https://github.com/buptLinfy/ZSE-SBIR}.
Related papers
- Cross-Modal Attention Alignment Network with Auxiliary Text Description for zero-shot sketch-based image retrieval [10.202562518113677]
We propose an approach called Cross-Modal Attention Alignment Network with Auxiliary Text Description for zero-shot sketch-based image retrieval.
Our key innovation lies in the usage of text data as auxiliary information for images, thus leveraging the inherent zero-shot generalization ability that language offers.
arXiv Detail & Related papers (2024-07-01T05:32:06Z) - Towards Self-Supervised FG-SBIR with Unified Sample Feature Alignment and Multi-Scale Token Recycling [11.129453244307369]
FG-SBIR aims to minimize the distance between sketches and corresponding images in the embedding space.
We propose an effective approach to narrow the gap between the two domains.
It mainly facilitates unified mutual information sharing both intra- and inter-samples.
arXiv Detail & Related papers (2024-06-17T13:49:12Z) - Modality-Aware Representation Learning for Zero-shot Sketch-based Image
Retrieval [10.568851068989973]
Zero-shot learning offers an efficient solution for a machine learning model to treat unseen categories.
We propose a novel framework that indirectly aligns sketches and photos by contrasting them through texts.
With an explicit modality encoding learned from data, our approach disentangles modality-agnostic semantics from modality-specific information.
arXiv Detail & Related papers (2024-01-10T00:39:03Z) - Symmetrical Bidirectional Knowledge Alignment for Zero-Shot Sketch-Based
Image Retrieval [69.46139774646308]
This paper studies the problem of zero-shot sketch-based image retrieval (ZS-SBIR)
It aims to use sketches from unseen categories as queries to match the images of the same category.
We propose a novel Symmetrical Bidirectional Knowledge Alignment for zero-shot sketch-based image retrieval (SBKA)
arXiv Detail & Related papers (2023-12-16T04:50:34Z) - Cross-Modal Fusion Distillation for Fine-Grained Sketch-Based Image
Retrieval [55.21569389894215]
We propose a cross-attention framework for Vision Transformers (XModalViT) that fuses modality-specific information instead of discarding them.
Our framework first maps paired datapoints from the individual photo and sketch modalities to fused representations that unify information from both modalities.
We then decouple the input space of the aforementioned modality fusion network into independent encoders of the individual modalities via contrastive and relational cross-modal knowledge distillation.
arXiv Detail & Related papers (2022-10-19T11:50:14Z) - Domain-Smoothing Network for Zero-Shot Sketch-Based Image Retrieval [66.37346493506737]
Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR) is a novel cross-modal retrieval task.
We propose a novel Domain-Smoothing Network (DSN) for ZS-SBIR.
Our approach notably outperforms the state-of-the-art methods in both Sketchy and TU-Berlin datasets.
arXiv Detail & Related papers (2021-06-22T14:58:08Z) - CrossATNet - A Novel Cross-Attention Based Framework for Sketch-Based
Image Retrieval [30.249581102239645]
We propose a novel framework for cross-modal zero-shot learning (ZSL) in the context of sketch-based image retrieval (SBIR)
While we define a cross-modal triplet loss to ensure the discriminative nature of the shared space, an innovative cross-modal attention learning strategy is also proposed to guide feature extraction from the image domain.
arXiv Detail & Related papers (2021-04-20T12:11:12Z) - StyleMeUp: Towards Style-Agnostic Sketch-Based Image Retrieval [119.03470556503942]
Crossmodal matching problem is typically solved by learning a joint embedding space where semantic content shared between photo and sketch modalities are preserved.
An effective model needs to explicitly account for this style diversity, crucially, to unseen user styles.
Our model can not only disentangle the cross-modal shared semantic content, but can adapt the disentanglement to any unseen user style as well, making the model truly agnostic.
arXiv Detail & Related papers (2021-03-29T15:44:19Z) - Semantically Tied Paired Cycle Consistency for Any-Shot Sketch-based
Image Retrieval [55.29233996427243]
Low-shot sketch-based image retrieval is an emerging task in computer vision.
In this paper, we address any-shot, i.e. zero-shot and few-shot, sketch-based image retrieval (SBIR) tasks.
For solving these tasks, we propose a semantically aligned cycle-consistent generative adversarial network (SEM-PCYC)
Our results demonstrate a significant boost in any-shot performance over the state-of-the-art on the extended version of the Sketchy, TU-Berlin and QuickDraw datasets.
arXiv Detail & Related papers (2020-06-20T22:43:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.