KAFA: Rethinking Image Ad Understanding with Knowledge-Augmented Feature
Adaptation of Vision-Language Models
- URL: http://arxiv.org/abs/2305.18373v1
- Date: Sun, 28 May 2023 04:49:01 GMT
- Title: KAFA: Rethinking Image Ad Understanding with Knowledge-Augmented Feature
Adaptation of Vision-Language Models
- Authors: Zhiwei Jia and Pradyumna Narayana and Arjun R. Akula and Garima Pruthi
and Hao Su and Sugato Basu and Varun Jampani
- Abstract summary: We perform the first empirical study of image ad understanding through the lens of pre-trained vision-language models (VLMs)
We propose a simple feature adaptation strategy to effectively fuse multimodal information for image ads and further empower it with knowledge of real-world entities.
- Score: 40.54372699488922
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Image ad understanding is a crucial task with wide real-world applications.
Although highly challenging with the involvement of diverse atypical scenes,
real-world entities, and reasoning over scene-texts, how to interpret image ads
is relatively under-explored, especially in the era of foundational
vision-language models (VLMs) featuring impressive generalizability and
adaptability. In this paper, we perform the first empirical study of image ad
understanding through the lens of pre-trained VLMs. We benchmark and reveal
practical challenges in adapting these VLMs to image ad understanding. We
propose a simple feature adaptation strategy to effectively fuse multimodal
information for image ads and further empower it with knowledge of real-world
entities. We hope our study draws more attention to image ad understanding
which is broadly relevant to the advertising industry.
Related papers
- Improving Fine-grained Visual Understanding in VLMs through Text-Only Training [0.0]
We investigate the feasibility of enhancing fine-grained visual understanding in Visual-Language Models (VLMs) through text-only training.
We conduct comprehensive experiments on two distinct domains: fine-grained species classification and cultural visual understanding tasks.
Our findings demonstrate that text-only training can be comparable to conventional image-text training while significantly reducing computational costs.
arXiv Detail & Related papers (2024-12-17T14:18:50Z) - COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training [49.2684130383925]
We propose COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training.
It integrates a novel text-cropping strategy and cross-attention module into a self-supervised learning framework.
It consistently outperforms previous strong baselines on various zero-shot downstream tasks.
arXiv Detail & Related papers (2024-12-02T18:56:06Z) - FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity [68.15983300711355]
FineCAPTION is a novel VLM that can recognize arbitrary masks as referential inputs and process high-resolution images for compositional image captioning at different levels.
We introduce COMPOSITIONCAP, a new dataset for multi-grained region compositional image captioning, which introduces the task of compositional attribute-aware regional image captioning.
arXiv Detail & Related papers (2024-11-23T02:20:32Z) - OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding [112.87441334765693]
OMG-LLaVA is a new framework combining powerful pixel-level vision understanding with reasoning abilities.
It can accept various visual and text prompts for flexible user interaction.
OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model.
arXiv Detail & Related papers (2024-06-27T17:59:01Z) - An Introduction to Vision-Language Modeling [128.6223984157515]
The vision-language model (VLM) applications will significantly impact our relationship with technology.
We introduce what VLMs are, how they work, and how to train them.
Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.
arXiv Detail & Related papers (2024-05-27T15:01:23Z) - Enhancing Image Retrieval : A Comprehensive Study on Photo Search using
the CLIP Mode [0.27195102129095]
Photo search has witnessed significant advancements with the introduction of CLIP (Contrastive Language-Image Pretraining) model.
This abstract summarizes the foundational principles of CLIP and highlights its potential impact on advancing the field of photo search.
arXiv Detail & Related papers (2024-01-24T17:35:38Z) - VLMAE: Vision-Language Masked Autoencoder [21.97700040013084]
We propose a vision-language masked autoencoder framework (VLMAE) for vision-language pre-training.
VLMAE employs visual generative learning, facilitating the model to acquire fine-grained and unbiased features.
arXiv Detail & Related papers (2022-08-19T14:39:18Z) - SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense
Reasoning [61.57887011165744]
multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning.
We propose a Scene Graph Enhanced Image-Text Learning framework to incorporate visual scene graphs in commonsense reasoning.
arXiv Detail & Related papers (2021-12-16T03:16:30Z) - Interpretable Visual Understanding with Cognitive Attention Network [20.991018495051623]
We propose a novel Cognitive Attention Network (CAN) for visual commonsense reasoning.
We first introduce an image-text fusion module to fuse information from images and text collectively.
Second, a novel inference module is designed to encode commonsense among image, query and response.
arXiv Detail & Related papers (2021-08-06T02:57:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.