Cross-Modal Fusion Distillation for Fine-Grained Sketch-Based Image
Retrieval
- URL: http://arxiv.org/abs/2210.10486v1
- Date: Wed, 19 Oct 2022 11:50:14 GMT
- Title: Cross-Modal Fusion Distillation for Fine-Grained Sketch-Based Image
Retrieval
- Authors: Abhra Chaudhuri, Massimiliano Mancini, Yanbei Chen, Zeynep Akata,
Anjan Dutta
- Abstract summary: We propose a cross-attention framework for Vision Transformers (XModalViT) that fuses modality-specific information instead of discarding them.
Our framework first maps paired datapoints from the individual photo and sketch modalities to fused representations that unify information from both modalities.
We then decouple the input space of the aforementioned modality fusion network into independent encoders of the individual modalities via contrastive and relational cross-modal knowledge distillation.
- Score: 55.21569389894215
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Representation learning for sketch-based image retrieval has mostly been
tackled by learning embeddings that discard modality-specific information. As
instances from different modalities can often provide complementary information
describing the underlying concept, we propose a cross-attention framework for
Vision Transformers (XModalViT) that fuses modality-specific information
instead of discarding them. Our framework first maps paired datapoints from the
individual photo and sketch modalities to fused representations that unify
information from both modalities. We then decouple the input space of the
aforementioned modality fusion network into independent encoders of the
individual modalities via contrastive and relational cross-modal knowledge
distillation. Such encoders can then be applied to downstream tasks like
cross-modal retrieval. We demonstrate the expressive capacity of the learned
representations by performing a wide range of experiments and achieving
state-of-the-art results on three fine-grained sketch-based image retrieval
benchmarks: Shoe-V2, Chair-V2 and Sketchy. Implementation is available at
https://github.com/abhrac/xmodal-vit.
Related papers
- DAE-Fuse: An Adaptive Discriminative Autoencoder for Multi-Modality Image Fusion [10.713089596405053]
Two-phase discriminative autoencoder framework, DAE-Fuse, generates sharp and natural fused images.
Experiments on public infrared-visible, medical image fusion, and downstream object detection datasets demonstrate our method's superiority and generalizability.
arXiv Detail & Related papers (2024-09-16T08:37:09Z) - Cross-Modal Attention Alignment Network with Auxiliary Text Description for zero-shot sketch-based image retrieval [10.202562518113677]
We propose an approach called Cross-Modal Attention Alignment Network with Auxiliary Text Description for zero-shot sketch-based image retrieval.
Our key innovation lies in the usage of text data as auxiliary information for images, thus leveraging the inherent zero-shot generalization ability that language offers.
arXiv Detail & Related papers (2024-07-01T05:32:06Z) - PV2TEA: Patching Visual Modality to Textual-Established Information
Extraction [59.76117533540496]
We patch the visual modality to the textual-established attribute information extractor.
PV2TEA is an encoder-decoder architecture equipped with three bias reduction schemes.
Empirical results on real-world e-Commerce datasets demonstrate up to 11.74% absolute (20.97% relatively) F1 increase over unimodal baselines.
arXiv Detail & Related papers (2023-06-01T05:39:45Z) - TriPINet: Tripartite Progressive Integration Network for Image
Manipulation Localization [3.7359400978194675]
We propose a tripartite progressive integration network (TriPINet) for end-to-end image manipulation localization.
We develop a guided cross-modality dual-attention (gCMDA) module to fuse different types of forged clues.
Extensive experiments are conducted to compare our method with state-of-the-art image forensics approaches.
arXiv Detail & Related papers (2022-12-25T02:27:58Z) - Single Stage Virtual Try-on via Deformable Attention Flows [51.70606454288168]
Virtual try-on aims to generate a photo-realistic fitting result given an in-shop garment and a reference person image.
We develop a novel Deformable Attention Flow (DAFlow) which applies the deformable attention scheme to multi-flow estimation.
Our proposed method achieves state-of-the-art performance both qualitatively and quantitatively.
arXiv Detail & Related papers (2022-07-19T10:01:31Z) - Multimodal Masked Autoencoders Learn Transferable Representations [127.35955819874063]
We propose a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE)
M3AE learns a unified encoder for both vision and language data via masked token prediction.
We provide an empirical study of M3AE trained on a large-scale image-text dataset, and find that M3AE is able to learn generalizable representations that transfer well to downstream tasks.
arXiv Detail & Related papers (2022-05-27T19:09:42Z) - SceneTrilogy: On Human Scene-Sketch and its Complementarity with Photo
and Text [109.69076457732632]
In this paper, we extend scene understanding to include that of human sketch.
We focus on learning a flexible joint embedding that fully supports the optionality" that this complementarity brings.
arXiv Detail & Related papers (2022-04-25T20:58:17Z) - Similarity-Aware Fusion Network for 3D Semantic Segmentation [87.51314162700315]
We propose a similarity-aware fusion network (SAFNet) to adaptively fuse 2D images and 3D point clouds for 3D semantic segmentation.
We employ a late fusion strategy where we first learn the geometric and contextual similarities between the input and back-projected (from 2D pixels) point clouds.
We show that SAFNet significantly outperforms existing state-of-the-art fusion-based approaches across various data integrity.
arXiv Detail & Related papers (2021-07-04T09:28:18Z) - Juggling With Representations: On the Information Transfer Between
Imagery, Point Clouds, and Meshes for Multi-Modal Semantics [0.0]
Images and Point Clouds (PCs) are fundamental data representations in urban applications.
We present a mesh-driven methodology that explicitly integrates imagery and PC mesh.
arXiv Detail & Related papers (2021-03-12T15:26:30Z) - Cross-modal Image Retrieval with Deep Mutual Information Maximization [14.778158582349137]
We study the cross-modal image retrieval, where the inputs contain a source image plus some text that describes certain modifications to this image and the desired image.
Our method narrows the modality gap between the text modality and the image modality by maximizing mutual information between their not exactly semantically identical representation.
arXiv Detail & Related papers (2021-03-10T13:08:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.