Automatic Creative Selection with Cross-Modal Matching
- URL: http://arxiv.org/abs/2405.00029v1
- Date: Wed, 28 Feb 2024 22:05:38 GMT
- Title: Automatic Creative Selection with Cross-Modal Matching
- Authors: Alex Kim, Jia Huang, Rob Monarch, Jerry Kwac, Anikesh Kamath, Parmeshwar Khurd, Kailash Thiyagarajan, Goodman Gu,
- Abstract summary: We present a novel approach to matching an App image to search terms based on fine-tuning a pre-trained LXMERT model.
We evaluate our approach using two sets of labels: advertiser associated (image, search term) pairs for a given application, and human ratings for the relevance between (image, search term) pairs.
- Score: 0.4215938932388723
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Application developers advertise their Apps by creating product pages with App images, and bidding on search terms. It is then crucial for App images to be highly relevant with the search terms. Solutions to this problem require an image-text matching model to predict the quality of the match between the chosen image and the search terms. In this work, we present a novel approach to matching an App image to search terms based on fine-tuning a pre-trained LXMERT model. We show that compared to the CLIP model and a baseline using a Transformer model for search terms, and a ResNet model for images, we significantly improve the matching accuracy. We evaluate our approach using two sets of labels: advertiser associated (image, search term) pairs for a given application, and human ratings for the relevance between (image, search term) pairs. Our approach achieves 0.96 AUC score for advertiser associated ground truth, outperforming the transformer+ResNet baseline and the fine-tuned CLIP model by 8% and 14%. For human labeled ground truth, our approach achieves 0.95 AUC score, outperforming the transformer+ResNet baseline and the fine-tuned CLIP model by 16% and 17%.
Related papers
- Optimizing CLIP Models for Image Retrieval with Maintained Joint-Embedding Alignment [0.7499722271664144]
Contrastive Language and Image Pairing (CLIP) is a transformative method in multimedia retrieval.
CLIP typically trains two neural networks concurrently to generate joint embeddings for text and image pairs.
This paper addresses the challenge of optimizing CLIP models for various image-based similarity search scenarios.
arXiv Detail & Related papers (2024-09-03T14:33:01Z) - Few-Shot Anomaly Detection via Category-Agnostic Registration Learning [65.64252994254268]
Most existing anomaly detection methods require a dedicated model for each category.
This article proposes a novel few-shot AD (FSAD) framework.
It is the first FSAD method that requires no model fine-tuning for novel categories.
arXiv Detail & Related papers (2024-06-13T05:01:13Z) - Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation [52.509092010267665]
We introduce LlamaGen, a new family of image generation models that apply original next-token prediction'' paradigm of large language models to visual generation domain.
It is an affirmative answer to whether vanilla autoregressive models, e.g., Llama, without inductive biases on visual signals can achieve state-of-the-art image generation performance if scaling properly.
arXiv Detail & Related papers (2024-06-10T17:59:52Z) - Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval [55.90407811819347]
We consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries.
We train a dual-encoder model starting from a language model pretrained on a large text corpus.
Compared to public dual-encoder models such as CLIP and OpenCLIP, the model trained with our best adaptation strategy achieves a significantly higher ranking similarity for paraphrased queries.
arXiv Detail & Related papers (2024-05-06T06:30:17Z) - Multi-method Integration with Confidence-based Weighting for Zero-shot Image Classification [1.7265013728931]
This paper introduces a novel framework for zero-shot learning (ZSL) to recognize new categories that are unseen during training.
We propose three strategies to enhance the model's performance to handle ZSL.
arXiv Detail & Related papers (2024-05-03T15:02:41Z) - Fine-tuning CLIP Text Encoders with Two-step Paraphrasing [83.3736789315201]
We introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases.
Our model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks.
arXiv Detail & Related papers (2024-02-23T06:11:50Z) - Improving Zero-shot Generalization and Robustness of Multi-modal Models [70.14692320804178]
Multi-modal image-text models such as CLIP and LiT have demonstrated impressive performance on image classification benchmarks.
We investigate the reasons for this performance gap and find that many of the failure cases are caused by ambiguity in the text prompts.
We propose a simple and efficient way to improve accuracy on such uncertain images by making use of the WordNet hierarchy.
arXiv Detail & Related papers (2022-12-04T07:26:24Z) - One-Time Model Adaptation to Heterogeneous Clients: An Intra-Client and
Inter-Image Attention Design [40.97593636235116]
We propose a new intra-client and inter-image attention (ICIIA) module into existing backbone recognition models.
In particular, given a target image from a certain client, ICIIA introduces multi-head self-attention to retrieve relevant images from the client's historical unlabeled images.
We evaluate ICIIA using 3 different recognition tasks with 9 backbone models over 5 representative datasets.
arXiv Detail & Related papers (2022-11-11T15:33:21Z) - Identical Image Retrieval using Deep Learning [0.0]
We are using the BigTransfer Model, which is a state-of-art model itself.
We extract the key features and train on the K-Nearest Neighbor model to obtain the nearest neighbor.
The application of our model is to find similar images, which are hard to achieve through text queries within a low inference time.
arXiv Detail & Related papers (2022-05-10T13:34:41Z) - Progressive Learning for Image Retrieval with Hybrid-Modality Queries [48.79599320198615]
Image retrieval with hybrid-modality queries, also known as composing text and image for image retrieval (CTI-IR)
We decompose the CTI-IR task into a three-stage learning problem to progressively learn the complex knowledge for image retrieval with hybrid-modality queries.
Our proposed model significantly outperforms state-of-the-art methods in the mean of Recall@K by 24.9% and 9.5% on the Fashion-IQ and Shoes benchmark datasets respectively.
arXiv Detail & Related papers (2022-04-24T08:10:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.