Related papers: Enhancing Cross-Modal Contextual Congruence for Crowdfunding Success using Knowledge-infused Learning

Enhancing Cross-Modal Contextual Congruence for Crowdfunding Success using Knowledge-infused Learning

URL: http://arxiv.org/abs/2402.03607v2
Date: Sun, 17 Nov 2024 21:40:50 GMT
Title: Enhancing Cross-Modal Contextual Congruence for Crowdfunding Success using Knowledge-infused Learning
Authors: Trilok Padhi, Ugur Kursuncu, Yaman Kumar, Valerie L. Shalin, Lane Peterson Fronczek,
Abstract summary: This work incorporates external commonsense knowledge from knowledge graphs to enhance the representation of multimodal data using compact Visual Language Models (VLMs) Our results show that external knowledge commonsense bridges the semantic gap between text and image modalities, and the enhanced knowledge-infused representations improve the predictive performance of models for campaign success upon the baselines without knowledge.
Score: 3.1021397647755613
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The digital landscape continually evolves with multimodality, enriching the online experience for users. Creators and marketers aim to weave subtle contextual cues from various modalities into congruent content to engage users with a harmonious message. This interplay of multimodal cues is often a crucial factor in attracting users' attention. However, this richness of multimodality presents a challenge to computational modeling, as the semantic contextual cues spanning across modalities need to be unified to capture the true holistic meaning of the multimodal content. This contextual meaning is critical in attracting user engagement as it conveys the intended message of the brand or the organization. In this work, we incorporate external commonsense knowledge from knowledge graphs to enhance the representation of multimodal data using compact Visual Language Models (VLMs) and predict the success of multi-modal crowdfunding campaigns. Our results show that external knowledge commonsense bridges the semantic gap between text and image modalities, and the enhanced knowledge-infused representations improve the predictive performance of models for campaign success upon the baselines without knowledge. Our findings highlight the significance of contextual congruence in online multimodal content for engaging and successful crowdfunding campaigns.

Related papers

TriMod Fusion for Multimodal Named Entity Recognition in Social Media [0.0]
We propose a novel approach that integrates textual, visual, and hashtag features (TriMod) for effective modality fusion. We demonstrate the superiority of our approach over existing state-of-the-art methods, achieving significant improvements in precision, recall, and F1 score.
arXiv Detail & Related papers (2025-01-14T17:29:41Z)
Turbo your multi-modal classification with contrastive learning [17.983460380784337]
In this paper, we propose a novel contrastive learning strategy, called $Turbo$, to promote multi-modal understanding. Specifically, multi-modal data pairs are sent through the forward pass twice with different hidden dropout masks to get two different representations for each modality. With these representations, we obtain multiple in-modal and cross-modal contrastive objectives for training.
arXiv Detail & Related papers (2024-09-14T03:15:34Z)
Fine-tuning Multimodal Large Language Models for Product Bundling [53.01642741096356]
We introduce Bundle-MLLM, a novel framework that fine-tunes large language models (LLMs) through a hybrid item tokenization approach. Specifically, we integrate textual, media, and relational data into a unified tokenization, introducing a soft separation token to distinguish between textual and non-textual tokens. We propose a progressive optimization strategy that fine-tunes LLMs for disentangled objectives: 1) learning bundle patterns and 2) enhancing multimodal semantic understanding specific to product bundling.
arXiv Detail & Related papers (2024-07-16T13:30:14Z)
SEED-Story: Multimodal Long Story Generation with Large Language Model [66.37077224696242]
SEED-Story is a novel method that leverages a Multimodal Large Language Model (MLLM) to generate extended multimodal stories. We propose multimodal attention sink mechanism to enable the generation of stories with up to 25 sequences (only 10 for training) in a highly efficient autoregressive manner. We present a large-scale and high-resolution dataset named StoryStream for training our model and quantitatively evaluating the task of multimodal story generation in various aspects.
arXiv Detail & Related papers (2024-07-11T17:21:03Z)
Multi-modal Crowd Counting via a Broker Modality [64.5356816448361]
Multi-modal crowd counting involves estimating crowd density from both visual and thermal/depth images. We propose a novel approach by introducing an auxiliary broker modality and frame the task as a triple-modal learning problem. We devise a fusion-based method to generate this broker modality, leveraging a non-diffusion, lightweight counterpart of modern denoising diffusion-based fusion models.
arXiv Detail & Related papers (2024-07-10T10:13:11Z)
SoMeLVLM: A Large Vision Language Model for Social Media Processing [78.47310657638567]
We introduce a Large Vision Language Model for Social Media Processing (SoMeLVLM) SoMeLVLM is a cognitive framework equipped with five key capabilities including knowledge & comprehension, application, analysis, evaluation, and creation. Our experiments demonstrate that SoMeLVLM achieves state-of-the-art performance in multiple social media tasks.
arXiv Detail & Related papers (2024-02-20T14:02:45Z)
DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention [55.2825684201129]
DeepSpeed-VisualChat is designed to optimize Large Language Models (LLMs) by incorporating multi-modal capabilities. Our framework is notable for (1) its open-source support for multi-round and multi-image dialogues, (2) introducing an innovative multi-modal causal attention mechanism, and (3) utilizing data blending techniques on existing datasets to assure seamless interactions.
arXiv Detail & Related papers (2023-09-25T17:53:29Z)
CISum: Learning Cross-modality Interaction to Enhance Multimodal Semantic Coverage for Multimodal Summarization [2.461695698601437]
This paper proposes a multi-task cross-modality learning framework (CISum) to improve multimodal semantic coverage. To obtain the visual semantics, we translate images into visual descriptions based on the correlation with text content. Then, the visual description and text content are fused to generate the textual summary to capture the semantics of the multimodal content.
arXiv Detail & Related papers (2023-02-20T11:57:23Z)
Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal Prediction for Multimodal Sentiment Analysis [19.07020276666615]
We propose a novel framework named MultiModal Contrastive Learning (MMCL) for multimodal representation to capture intra- and inter-modality dynamics simultaneously. We also design two contrastive learning tasks, instance- and sentiment-based contrastive learning, to promote the process of prediction and learn more interactive information related to sentiment.
arXiv Detail & Related papers (2022-10-26T08:24:15Z)
Visual Persuasion in COVID-19 Social Media Content: A Multi-Modal Characterization [30.710295617831015]
This work proposes a computational approach to analyze the outcome of persuasive information in multi-modal content. It focuses on two aspects, popularity and reliability, in COVID-19-related news articles shared on Twitter.
arXiv Detail & Related papers (2021-12-05T02:15:01Z)
MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation [32.15124603618625]
We propose a new model based on multimodal fused graph convolutional network, MMGCN, in this work. MMGCN can not only make use of multimodal dependencies effectively, but also leverage speaker information to model inter-speaker and intra-speaker dependency. We evaluate our proposed model on two public benchmark datasets, IEMOCAP and MELD, and the results prove the effectiveness of MMGCN.
arXiv Detail & Related papers (2021-07-14T15:37:02Z)
Cross-Media Keyphrase Prediction: A Unified Framework with Multi-Modality Multi-Head Attention and Image Wordings [63.79979145520512]
We explore the joint effects of texts and images in predicting the keyphrases for a multimedia post. We propose a novel Multi-Modality Multi-Head Attention (M3H-Att) to capture the intricate cross-media interactions. Our model significantly outperforms the previous state of the art based on traditional attention networks.
arXiv Detail & Related papers (2020-11-03T08:44:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.