KAFA: Rethinking Image Ad Understanding with Knowledge-Augmented Feature
Adaptation of Vision-Language Models
- URL: http://arxiv.org/abs/2305.18373v1
- Date: Sun, 28 May 2023 04:49:01 GMT
- Title: KAFA: Rethinking Image Ad Understanding with Knowledge-Augmented Feature
Adaptation of Vision-Language Models
- Authors: Zhiwei Jia and Pradyumna Narayana and Arjun R. Akula and Garima Pruthi
and Hao Su and Sugato Basu and Varun Jampani
- Abstract summary: We perform the first empirical study of image ad understanding through the lens of pre-trained vision-language models (VLMs)
We propose a simple feature adaptation strategy to effectively fuse multimodal information for image ads and further empower it with knowledge of real-world entities.
- Score: 40.54372699488922
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Image ad understanding is a crucial task with wide real-world applications.
Although highly challenging with the involvement of diverse atypical scenes,
real-world entities, and reasoning over scene-texts, how to interpret image ads
is relatively under-explored, especially in the era of foundational
vision-language models (VLMs) featuring impressive generalizability and
adaptability. In this paper, we perform the first empirical study of image ad
understanding through the lens of pre-trained VLMs. We benchmark and reveal
practical challenges in adapting these VLMs to image ad understanding. We
propose a simple feature adaptation strategy to effectively fuse multimodal
information for image ads and further empower it with knowledge of real-world
entities. We hope our study draws more attention to image ad understanding
which is broadly relevant to the advertising industry.
Related papers
- What's in the Image? A Deep-Dive into the Vision of Vision Language Models [20.669971132114195]
Vision-Language Models (VLMs) have recently demonstrated remarkable capabilities in comprehending complex visual content.
In this paper, we conduct a thorough empirical analysis, focusing on attention modules across layers.
We reveal several key insights about how these models process visual data.
arXiv Detail & Related papers (2024-11-26T14:59:06Z) - FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity [68.15983300711355]
FineCAPTION is a novel VLM that can recognize arbitrary masks as referential inputs and process high-resolution images for compositional image captioning at different levels.
We introduce COMPOSITIONCAP, a new dataset for multi-grained region compositional image captioning, which introduces the task of compositional attribute-aware regional image captioning.
arXiv Detail & Related papers (2024-11-23T02:20:32Z) - OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding [112.87441334765693]
OMG-LLaVA is a new framework combining powerful pixel-level vision understanding with reasoning abilities.
It can accept various visual and text prompts for flexible user interaction.
OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model.
arXiv Detail & Related papers (2024-06-27T17:59:01Z) - An Introduction to Vision-Language Modeling [128.6223984157515]
The vision-language model (VLM) applications will significantly impact our relationship with technology.
We introduce what VLMs are, how they work, and how to train them.
Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.
arXiv Detail & Related papers (2024-05-27T15:01:23Z) - Enhancing Image Retrieval : A Comprehensive Study on Photo Search using
the CLIP Mode [0.27195102129095]
Photo search has witnessed significant advancements with the introduction of CLIP (Contrastive Language-Image Pretraining) model.
This abstract summarizes the foundational principles of CLIP and highlights its potential impact on advancing the field of photo search.
arXiv Detail & Related papers (2024-01-24T17:35:38Z) - VLMAE: Vision-Language Masked Autoencoder [21.97700040013084]
We propose a vision-language masked autoencoder framework (VLMAE) for vision-language pre-training.
VLMAE employs visual generative learning, facilitating the model to acquire fine-grained and unbiased features.
arXiv Detail & Related papers (2022-08-19T14:39:18Z) - Visually-Augmented Language Modeling [137.36789885105642]
We propose a novel pre-training framework, named VaLM, to Visually-augment text tokens with retrieved relevant images for Language Modeling.
With the visually-augmented context, VaLM uses a visual knowledge fusion layer to enable multimodal grounded language modeling.
We evaluate the proposed model on various multimodal commonsense reasoning tasks, which require visual information to excel.
arXiv Detail & Related papers (2022-05-20T13:41:12Z) - SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense
Reasoning [61.57887011165744]
multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning.
We propose a Scene Graph Enhanced Image-Text Learning framework to incorporate visual scene graphs in commonsense reasoning.
arXiv Detail & Related papers (2021-12-16T03:16:30Z) - Interpretable Visual Understanding with Cognitive Attention Network [20.991018495051623]
We propose a novel Cognitive Attention Network (CAN) for visual commonsense reasoning.
We first introduce an image-text fusion module to fuse information from images and text collectively.
Second, a novel inference module is designed to encode commonsense among image, query and response.
arXiv Detail & Related papers (2021-08-06T02:57:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.