Is Cross-modal Information Retrieval Possible without Training?
- URL: http://arxiv.org/abs/2304.11095v1
- Date: Thu, 20 Apr 2023 02:36:18 GMT
- Title: Is Cross-modal Information Retrieval Possible without Training?
- Authors: Hyunjin Choi, Hyunjae Lee, Seongho Joe, Youngjune L. Gwon
- Abstract summary: We take a simple mapping computed from the least squares and singular value decomposition (SVD) for a solution to the Procrustes problem.
That is, given information in one modality such as text, the mapping helps us locate a semantically equivalent data item in another modality such as image.
Using off-the-shelf pretrained deep learning models, we have experimented the aforementioned simple cross-modal mappings in tasks of text-to-image and image-to-text retrieval.
- Score: 4.616703548353372
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Encoded representations from a pretrained deep learning model (e.g., BERT
text embeddings, penultimate CNN layer activations of an image) convey a rich
set of features beneficial for information retrieval. Embeddings for a
particular modality of data occupy a high-dimensional space of its own, but it
can be semantically aligned to another by a simple mapping without training a
deep neural net. In this paper, we take a simple mapping computed from the
least squares and singular value decomposition (SVD) for a solution to the
Procrustes problem to serve a means to cross-modal information retrieval. That
is, given information in one modality such as text, the mapping helps us locate
a semantically equivalent data item in another modality such as image. Using
off-the-shelf pretrained deep learning models, we have experimented the
aforementioned simple cross-modal mappings in tasks of text-to-image and
image-to-text retrieval. Despite simplicity, our mappings perform reasonably
well reaching the highest accuracy of 77% on recall@10, which is comparable to
those requiring costly neural net training and fine-tuning. We have improved
the simple mappings by contrastive learning on the pretrained models.
Contrastive learning can be thought as properly biasing the pretrained encoders
to enhance the cross-modal mapping quality. We have further improved the
performance by multilayer perceptron with gating (gMLP), a simple neural
architecture.
Related papers
- Self-Supervised Pre-Training with Contrastive and Masked Autoencoder
Methods for Dealing with Small Datasets in Deep Learning for Medical Imaging [8.34398674359296]
Deep learning in medical imaging has the potential to minimize the risk of diagnostic errors, reduce radiologist workload, and accelerate diagnosis.
Training such deep learning models requires large and accurate datasets, with annotations for all training samples.
To address this challenge, deep learning models can be pre-trained on large image datasets without annotations using methods from the field of self-supervised learning.
arXiv Detail & Related papers (2023-08-12T11:31:01Z) - PRSNet: A Masked Self-Supervised Learning Pedestrian Re-Identification
Method [2.0411082897313984]
This paper designs a pre-task of mask reconstruction to obtain a pre-training model with strong robustness.
The training optimization of the network is performed by improving the triplet loss based on the centroid.
This method achieves about 5% higher mAP on Marker1501 and CUHK03 data than existing self-supervised learning pedestrian re-identification methods.
arXiv Detail & Related papers (2023-03-11T07:20:32Z) - ALSO: Automotive Lidar Self-supervision by Occupancy estimation [70.70557577874155]
We propose a new self-supervised method for pre-training the backbone of deep perception models operating on point clouds.
The core idea is to train the model on a pretext task which is the reconstruction of the surface on which the 3D points are sampled.
The intuition is that if the network is able to reconstruct the scene surface, given only sparse input points, then it probably also captures some fragments of semantic information.
arXiv Detail & Related papers (2022-12-12T13:10:19Z) - ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training [29.240131406803794]
We show that a common space can be created without any training at all, using single-domain encoders and a much smaller amount of image-text pairs.
Our model has unique properties, most notably, deploying a new version with updated training samples can be done in a matter of seconds.
arXiv Detail & Related papers (2022-10-04T16:56:22Z) - Adaptive Convolutional Dictionary Network for CT Metal Artifact
Reduction [62.691996239590125]
We propose an adaptive convolutional dictionary network (ACDNet) for metal artifact reduction.
Our ACDNet can automatically learn the prior for artifact-free CT images via training data and adaptively adjust the representation kernels for each input CT image.
Our method inherits the clear interpretability of model-based methods and maintains the powerful representation ability of learning-based methods.
arXiv Detail & Related papers (2022-05-16T06:49:36Z) - Neural Maximum A Posteriori Estimation on Unpaired Data for Motion
Deblurring [87.97330195531029]
We propose a Neural Maximum A Posteriori (NeurMAP) estimation framework for training neural networks to recover blind motion information and sharp content from unpaired data.
The proposed NeurMAP is an approach to existing deblurring neural networks, and is the first framework that enables training image deblurring networks on unpaired datasets.
arXiv Detail & Related papers (2022-04-26T08:09:47Z) - Is Deep Image Prior in Need of a Good Education? [57.3399060347311]
Deep image prior was introduced as an effective prior for image reconstruction.
Despite its impressive reconstructive properties, the approach is slow when compared to learned or traditional reconstruction techniques.
We develop a two-stage learning paradigm to address the computational challenge.
arXiv Detail & Related papers (2021-11-23T15:08:26Z) - Predicting What You Already Know Helps: Provable Self-Supervised
Learning [60.27658820909876]
Self-supervised representation learning solves auxiliary prediction tasks (known as pretext tasks) without requiring labeled data.
We show a mechanism exploiting the statistical connections between certain em reconstruction-based pretext tasks that guarantee to learn a good representation.
We prove the linear layer yields small approximation error even for complex ground truth function class.
arXiv Detail & Related papers (2020-08-03T17:56:13Z) - Adversarially-Trained Deep Nets Transfer Better: Illustration on Image
Classification [53.735029033681435]
Transfer learning is a powerful methodology for adapting pre-trained deep neural networks on image recognition tasks to new domains.
In this work, we demonstrate that adversarially-trained models transfer better than non-adversarially-trained models.
arXiv Detail & Related papers (2020-07-11T22:48:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.