Related papers: Cross-Modal Prototype Augmentation and Dual-Grained Prompt Learning for Social Media Popularity Prediction

Cross-Modal Prototype Augmentation and Dual-Grained Prompt Learning for Social Media Popularity Prediction

URL: http://arxiv.org/abs/2508.16147v1
Date: Fri, 22 Aug 2025 07:16:47 GMT
Title: Cross-Modal Prototype Augmentation and Dual-Grained Prompt Learning for Social Media Popularity Prediction
Authors: Ao Zhou, Mingsheng Tu, Luping Wang, Tenghao Sun, Zifeng Cheng, Yafeng Yin, Zhiwei Jiang, Qing Gu,
Abstract summary: Social Media Popularity Prediction is a complex task that requires effective integration of images, text, and structured information.<n>We introduce hierarchical prototypes for structural enhancement and contrastive learning for improved vision-text alignment.<n>We propose a feature-enhanced framework integrating dual-grained prompt learning and cross-modal attention mechanisms.
Score: 16.452218354378452
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Social Media Popularity Prediction is a complex multimodal task that requires effective integration of images, text, and structured information. However, current approaches suffer from inadequate visual-textual alignment and fail to capture the inherent cross-content correlations and hierarchical patterns in social media data. To overcome these limitations, we establish a multi-class framework , introducing hierarchical prototypes for structural enhancement and contrastive learning for improved vision-text alignment. Furthermore, we propose a feature-enhanced framework integrating dual-grained prompt learning and cross-modal attention mechanisms, achieving precise multimodal representation through fine-grained category modeling. Experimental results demonstrate state-of-the-art performance on benchmark metrics, establishing new reference standards for multimodal social media analysis.

Related papers

UniG2U-Bench: Do Unified Models Advance Multimodal Understanding? [50.92401586025528]
Unified multimodal models have recently demonstrated strong generative capabilities, yet whether and when generation improves understanding remains unclear.<n>We introduce UniG2U-Bench, a comprehensive benchmark categorizing generation-to-understanding (G2U) evaluation into 7 regimes and 30 subtasks.
arXiv Detail & Related papers (2026-03-03T18:36:16Z)
Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery [15.933337984000346]
Generalized Category Discovery (GCD) aims to identify both known and unknown categories.<n>We propose a novel and effective multi-modal representation learning framework for GCD via Semi-Supervised Rate Reduction.<n>We conduct extensive experiments on generic and fine-grained benchmark datasets demonstrating superior performance of our approach.
arXiv Detail & Related papers (2026-02-23T14:51:09Z)
Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents [99.62178668680578]
We propose Vision-Centric Contrastive Learning (VC2L), a unified framework that models text, images, and their combinations using a single vision transformer.<n> VC2L operates entirely in pixel space by rendering all inputs, whether textual, visual, or combined, as images.<n>To capture complex cross-modal relationships in web documents, VC2L employs a snippet-level contrastive learning objective that aligns consecutive multimodal segments.
arXiv Detail & Related papers (2025-10-21T14:59:29Z)
Query-Kontext: An Unified Multimodal Model for Image Generation and Editing [53.765351127477224]
Unified Multimodal Models (UMMs) have demonstrated remarkable performance in text-to-image generation (T2I) and editing (TI2I)<n>We introduce Query-Kontext, a novel approach that bridges the VLM and diffusion model via a multimodal kontext'' composed of semantic cues and coarse-grained image conditions encoded from multimodal inputs.<n> Experiments show that our approach matches strong unified baselines and even outperforms task-specific state-of-the-art methods in several cases.
arXiv Detail & Related papers (2025-09-30T17:59:46Z)
UniAlignment: Semantic Alignment for Unified Image Generation, Understanding, Manipulation and Perception [54.53657134205492]
UniAlignment is a unified multimodal generation framework within a single diffusion transformer.<n>It incorporates both intrinsic-modal semantic alignment and cross-modal semantic alignment, thereby enhancing the model's cross-modal consistency and instruction-following robustness.<n>We present SemGen-Bench, a new benchmark specifically designed to evaluate multimodal semantic consistency under complex textual instructions.
arXiv Detail & Related papers (2025-09-28T09:11:30Z)
Structures Meet Semantics: Multimodal Fusion via Graph Contrastive Learning [8.187594234413568]
We propose a novel framework called the Structural-Semantic Unifier (SSU)<n>SSU integrates modality-specific structural information and cross-modal semantic grounding for enhanced multimodal representations.<n>SSU consistently achieves state-of-the-art performance while significantly reducing computational overhead.
arXiv Detail & Related papers (2025-08-24T13:44:54Z)
Co-AttenDWG: Co-Attentive Dimension-Wise Gating and Expert Fusion for Multi-Modal Offensive Content Detection [0.0]
Multi-modal learning has emerged as a crucial research direction.<n>Existing approaches often suffer from insufficient cross-modal interactions and rigid fusion strategies.<n>We propose Co-AttenDWG, co-attention with dimension-wise gating, and expert fusion.<n>We show that Co-AttenDWG achieves state-of-the-art performance and superior cross-modal alignment.
arXiv Detail & Related papers (2025-05-25T07:26:00Z)
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation [54.588082888166504]
We present Mogao, a unified framework that enables interleaved multi-modal generation through a causal approach.<n>Mogoo integrates a set of key technical improvements in architecture design, including a deep-fusion design, dual vision encoders, interleaved rotary position embeddings, and multi-modal classifier-free guidance.<n>Experiments show that Mogao achieves state-of-the-art performance in multi-modal understanding and text-to-image generation, but also excels in producing high-quality, coherent interleaved outputs.
arXiv Detail & Related papers (2025-05-08T17:58:57Z)
A Multi-Granularity Retrieval Framework for Visually-Rich Documents [4.804551482123172]
We propose a unified multi-granularity multimodal retrieval framework tailored for two benchmark tasks: MMDocIR and M2KR.<n>Our approach integrates hierarchical encoding strategies, modality-aware retrieval mechanisms, and vision-language model (VLM)-based candidate filtering.<n>Our framework demonstrates robust performance without the need for task-specific fine-tuning.
arXiv Detail & Related papers (2025-05-01T02:40:30Z)
Unified Generative and Discriminative Training for Multi-modal Large Language Models [88.84491005030316]
Generative training has enabled Vision-Language Models (VLMs) to tackle various complex tasks. Discriminative training, exemplified by models like CLIP, excels in zero-shot image-text classification and retrieval. This paper proposes a unified approach that integrates the strengths of both paradigms.
arXiv Detail & Related papers (2024-11-01T01:51:31Z)
Advanced Multimodal Deep Learning Architecture for Image-Text Matching [33.8315200009152]
Image-text matching is a key multimodal task that aims to model the semantic association between images and text as a matching relationship. We introduce an advanced multimodal deep learning architecture, which combines the high-level abstract representation ability of deep neural networks for visual information with the advantages of natural language processing models for text semantic understanding. Experiments show that compared with existing image-text matching models, the optimized new model has significantly improved performance on a series of benchmark data sets.
arXiv Detail & Related papers (2024-06-13T08:32:24Z)
Multi-Modal Prototypes for Open-World Semantic Segmentation [37.84805778548119]
We propose to encompass textual and visual clues as multi-modal prototypes to allow more comprehensive support for semantic segmentation. We decompose the high-level language information as multi-aspect prototypes and aggregate the low-level visual information as more semantic prototypes. Based on an elastic mask prediction module, we are able to solve the zero-shot, few-shot and generalized counterpart tasks in one architecture.
arXiv Detail & Related papers (2023-07-05T03:27:31Z)
Cross-Media Keyphrase Prediction: A Unified Framework with Multi-Modality Multi-Head Attention and Image Wordings [63.79979145520512]
We explore the joint effects of texts and images in predicting the keyphrases for a multimedia post. We propose a novel Multi-Modality Multi-Head Attention (M3H-Att) to capture the intricate cross-media interactions. Our model significantly outperforms the previous state of the art based on traditional attention networks.
arXiv Detail & Related papers (2020-11-03T08:44:18Z)
BriNet: Towards Bridging the Intra-class and Inter-class Gaps in One-Shot Segmentation [84.2925550033094]
Few-shot segmentation focuses on the generalization of models to segment unseen object instances with limited training samples. We propose a framework, BriNet, to bridge the gaps between the extracted features of the query and support images. The effectiveness of our framework is demonstrated by experimental results, which outperforms other competitive methods.
arXiv Detail & Related papers (2020-08-14T07:45:50Z)
Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation. Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning. During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.