Improving Contextual Congruence Across Modalities for Effective
Multimodal Marketing using Knowledge-infused Learning
- URL: http://arxiv.org/abs/2402.03607v1
- Date: Tue, 6 Feb 2024 00:51:27 GMT
- Title: Improving Contextual Congruence Across Modalities for Effective
Multimodal Marketing using Knowledge-infused Learning
- Authors: Trilok Padhi, Ugur Kursuncu, Yaman Kumar, Valerie L. Shalin, Lane
Peterson Fronczek
- Abstract summary: Large Language (LLMs) and Vision models (LVMs) are still limited in capturing holistic meaning with cross-modal semantic relationships.
We design a framework to couple explicit commonsense knowledge in the form of knowledge graphs with large VLMs to improve the performance of a downstream task.
Our approach enables the early detection of likely persuasive multi-modal campaigns and the assessment and augmentation of marketing theory.
- Score: 3.3281180957341117
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The prevalence of smart devices with the ability to capture moments in
multiple modalities has enabled users to experience multimodal information
online. However, large Language (LLMs) and Vision models (LVMs) are still
limited in capturing holistic meaning with cross-modal semantic relationships.
Without explicit, common sense knowledge (e.g., as a knowledge graph), Visual
Language Models (VLMs) only learn implicit representations by capturing
high-level patterns in vast corpora, missing essential contextual cross-modal
cues. In this work, we design a framework to couple explicit commonsense
knowledge in the form of knowledge graphs with large VLMs to improve the
performance of a downstream task, predicting the effectiveness of multi-modal
marketing campaigns. While the marketing application provides a compelling
metric for assessing our methods, our approach enables the early detection of
likely persuasive multi-modal campaigns and the assessment and augmentation of
marketing theory.
Related papers
- TriMod Fusion for Multimodal Named Entity Recognition in Social Media [0.0]
We propose a novel approach that integrates textual, visual, and hashtag features (TriMod) for effective modality fusion.
We demonstrate the superiority of our approach over existing state-of-the-art methods, achieving significant improvements in precision, recall, and F1 score.
arXiv Detail & Related papers (2025-01-14T17:29:41Z) - Turbo your multi-modal classification with contrastive learning [17.983460380784337]
In this paper, we propose a novel contrastive learning strategy, called $Turbo$, to promote multi-modal understanding.
Specifically, multi-modal data pairs are sent through the forward pass twice with different hidden dropout masks to get two different representations for each modality.
With these representations, we obtain multiple in-modal and cross-modal contrastive objectives for training.
arXiv Detail & Related papers (2024-09-14T03:15:34Z) - Fine-tuning Multimodal Large Language Models for Product Bundling [53.01642741096356]
We introduce Bundle-MLLM, a novel framework that fine-tunes large language models (LLMs) through a hybrid item tokenization approach.
Specifically, we integrate textual, media, and relational data into a unified tokenization, introducing a soft separation token to distinguish between textual and non-textual tokens.
We propose a progressive optimization strategy that fine-tunes LLMs for disentangled objectives: 1) learning bundle patterns and 2) enhancing multimodal semantic understanding specific to product bundling.
arXiv Detail & Related papers (2024-07-16T13:30:14Z) - SEED-Story: Multimodal Long Story Generation with Large Language Model [66.37077224696242]
SEED-Story is a novel method that leverages a Multimodal Large Language Model (MLLM) to generate extended multimodal stories.
We propose multimodal attention sink mechanism to enable the generation of stories with up to 25 sequences (only 10 for training) in a highly efficient autoregressive manner.
We present a large-scale and high-resolution dataset named StoryStream for training our model and quantitatively evaluating the task of multimodal story generation in various aspects.
arXiv Detail & Related papers (2024-07-11T17:21:03Z) - Multi-modal Crowd Counting via a Broker Modality [64.5356816448361]
Multi-modal crowd counting involves estimating crowd density from both visual and thermal/depth images.
We propose a novel approach by introducing an auxiliary broker modality and frame the task as a triple-modal learning problem.
We devise a fusion-based method to generate this broker modality, leveraging a non-diffusion, lightweight counterpart of modern denoising diffusion-based fusion models.
arXiv Detail & Related papers (2024-07-10T10:13:11Z) - SoMeLVLM: A Large Vision Language Model for Social Media Processing [78.47310657638567]
We introduce a Large Vision Language Model for Social Media Processing (SoMeLVLM)
SoMeLVLM is a cognitive framework equipped with five key capabilities including knowledge & comprehension, application, analysis, evaluation, and creation.
Our experiments demonstrate that SoMeLVLM achieves state-of-the-art performance in multiple social media tasks.
arXiv Detail & Related papers (2024-02-20T14:02:45Z) - DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via
Multi-Modal Causal Attention [55.2825684201129]
DeepSpeed-VisualChat is designed to optimize Large Language Models (LLMs) by incorporating multi-modal capabilities.
Our framework is notable for (1) its open-source support for multi-round and multi-image dialogues, (2) introducing an innovative multi-modal causal attention mechanism, and (3) utilizing data blending techniques on existing datasets to assure seamless interactions.
arXiv Detail & Related papers (2023-09-25T17:53:29Z) - CISum: Learning Cross-modality Interaction to Enhance Multimodal
Semantic Coverage for Multimodal Summarization [2.461695698601437]
This paper proposes a multi-task cross-modality learning framework (CISum) to improve multimodal semantic coverage.
To obtain the visual semantics, we translate images into visual descriptions based on the correlation with text content.
Then, the visual description and text content are fused to generate the textual summary to capture the semantics of the multimodal content.
arXiv Detail & Related papers (2023-02-20T11:57:23Z) - Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal
Prediction for Multimodal Sentiment Analysis [19.07020276666615]
We propose a novel framework named MultiModal Contrastive Learning (MMCL) for multimodal representation to capture intra- and inter-modality dynamics simultaneously.
We also design two contrastive learning tasks, instance- and sentiment-based contrastive learning, to promote the process of prediction and learn more interactive information related to sentiment.
arXiv Detail & Related papers (2022-10-26T08:24:15Z) - Visual Persuasion in COVID-19 Social Media Content: A Multi-Modal
Characterization [30.710295617831015]
This work proposes a computational approach to analyze the outcome of persuasive information in multi-modal content.
It focuses on two aspects, popularity and reliability, in COVID-19-related news articles shared on Twitter.
arXiv Detail & Related papers (2021-12-05T02:15:01Z) - MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion
Recognition in Conversation [32.15124603618625]
We propose a new model based on multimodal fused graph convolutional network, MMGCN, in this work.
MMGCN can not only make use of multimodal dependencies effectively, but also leverage speaker information to model inter-speaker and intra-speaker dependency.
We evaluate our proposed model on two public benchmark datasets, IEMOCAP and MELD, and the results prove the effectiveness of MMGCN.
arXiv Detail & Related papers (2021-07-14T15:37:02Z) - Cross-Media Keyphrase Prediction: A Unified Framework with
Multi-Modality Multi-Head Attention and Image Wordings [63.79979145520512]
We explore the joint effects of texts and images in predicting the keyphrases for a multimedia post.
We propose a novel Multi-Modality Multi-Head Attention (M3H-Att) to capture the intricate cross-media interactions.
Our model significantly outperforms the previous state of the art based on traditional attention networks.
arXiv Detail & Related papers (2020-11-03T08:44:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.