Higher-order interactions of multi-layer prompt
- URL: http://arxiv.org/abs/2510.09394v2
- Date: Thu, 16 Oct 2025 15:48:27 GMT
- Title: Higher-order interactions of multi-layer prompt
- Authors: Ziyu Zheng, Yaming Yang, Ziyu Guan, Wei Zhao, Xinyan Huang, Weigang Lu,
- Abstract summary: "Pre-train, prompt" paradigm has successfully evolved in representation learning.<n>Current prompt-tuning methods treat prompts as isolated, independent components across different network layers.<n>We propose a novel framework that explicitly models the Higher-order Interactions of Multi-layer Prompt.
- Score: 22.205298192935313
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The "pre-train, prompt" paradigm has successfully evolved in representation learning. While current prompt-tuning methods often introduce learnable prompts, they predominantly treat prompts as isolated, independent components across different network layers. This overlooks the complex and synergistic higher-order interactions that exist between prompts at various hierarchical depths, consequently limiting the expressive power and semantic richness of the prompted model. To address this fundamental gap, we propose a novel framework that explicitly models the Higher-order Interactions of Multi-layer Prompt. Our approach conceptualizes prompts from different layers not as separate entities, but as a cohesive system where their inter-relationships are critical. We design an innovative interaction module that captures these sophisticated, non-linear correlations among multi-layer prompts, effectively modeling their cooperative effects. This allows the model to dynamically aggregate and refine prompt information across the network's depth, leading to a more integrated and powerful prompting strategy. Extensive experiments on eight benchmark datasets demonstrate that our method, by leveraging these higher-order interactions, consistently surpasses state-of-the-art prompt-tuning baselines. The performance advantage is particularly pronounced in few-shot scenarios, validating that capturing the intricate interplay between multi-layer prompts is key to unlocking more robust and generalizable representation learning.
Related papers
- Agentic Conversational Search with Contextualized Reasoning via Reinforcement Learning [66.52010873968383]
We introduce a conversational agent that interleaves search and reasoning across turns, enabling exploratory and adaptive behaviors learned through reinforcement learning (RL) training.<n>The experimental results across four widely used conversational benchmarks demonstrate the effectiveness of our methods.
arXiv Detail & Related papers (2026-01-19T14:55:54Z) - Enhancing Visual In-Context Learning by Multi-Faceted Fusion [6.852150407828682]
We introduce a novel framework that moves beyond single-prompt fusion towards an multi-combination collaborative fusion.<n>Our method generates three contextual representation branches, each formed by integrating information from different combinations of top-quality prompts.<n>Experiments on diverse tasks, including foreground segmentation, single-object detection, and image colorization, highlight its strong cross-task generalization.
arXiv Detail & Related papers (2026-01-15T06:25:09Z) - Guiding Mixture-of-Experts with Temporal Multimodal Interactions [30.728093182390364]
We propose a novel framework that guides MoE routing using quantified temporal interaction.<n>A multimodal interaction-aware router learns to dispatch tokens to experts based on the nature of their interactions.
arXiv Detail & Related papers (2025-09-30T02:26:31Z) - UniAlignment: Semantic Alignment for Unified Image Generation, Understanding, Manipulation and Perception [54.53657134205492]
UniAlignment is a unified multimodal generation framework within a single diffusion transformer.<n>It incorporates both intrinsic-modal semantic alignment and cross-modal semantic alignment, thereby enhancing the model's cross-modal consistency and instruction-following robustness.<n>We present SemGen-Bench, a new benchmark specifically designed to evaluate multimodal semantic consistency under complex textual instructions.
arXiv Detail & Related papers (2025-09-28T09:11:30Z) - HeLoFusion: An Efficient and Scalable Encoder for Modeling Heterogeneous and Multi-Scale Interactions in Trajectory Prediction [11.30785902722196]
HeLoFusion is an efficient and scalable encoder for modeling heterogeneous and multi-scale agent interactions.<n>Our work demonstrates that a locality-grounded architecture, which explicitly models multi-scale and heterogeneous interactions, is a highly effective strategy for advancing motion forecasting.
arXiv Detail & Related papers (2025-09-15T09:19:41Z) - Boosting Neural Language Inference via Cascaded Interactive Reasoning [38.125341836302525]
Natural Language Inference (NLI) focuses on ascertaining the logical relationship between a given premise and hypothesis.<n>This task presents significant challenges due to inherent linguistic features such as diverse phrasing, semantic complexity, and contextual nuances.<n>We introduce the Cascaded Interactive Reasoning Network (CIRN), a novel architecture designed for deeper semantic comprehension in NLI.
arXiv Detail & Related papers (2025-05-10T11:37:15Z) - Hierarchical Banzhaf Interaction for General Video-Language Representation Learning [60.44337740854767]
Multimodal representation learning plays an important role in the artificial intelligence domain.<n>We introduce a new approach that models video-text as game players using multivariate cooperative game theory.<n>We extend our original structure into a flexible encoder-decoder framework, enabling the model to adapt to various downstream tasks.
arXiv Detail & Related papers (2024-12-30T14:09:15Z) - Can Graph Neural Networks Learn Language with Extremely Weak Text Supervision? [62.12375949429938]
We propose a multi-modal prompt learning paradigm to adapt pre-trained Graph Neural Networks to downstream tasks and data.<n>Our new paradigm embeds the graphs directly in the same space as the Large Language Models (LLMs) by learning both graph prompts and text prompts simultaneously.<n>We build the first CLIP-style zero-shot classification prototype that can generalize GNNs to unseen classes with extremely weak text supervision.
arXiv Detail & Related papers (2024-12-11T08:03:35Z) - Instance-Aware Graph Prompt Learning [71.26108600288308]
We introduce Instance-Aware Graph Prompt Learning (IA-GPL) in this paper.
The process involves generating intermediate prompts for each instance using a lightweight architecture.
Experiments conducted on multiple datasets and settings showcase the superior performance of IA-GPL compared to state-of-the-art baselines.
arXiv Detail & Related papers (2024-11-26T18:38:38Z) - DeepInteraction++: Multi-Modality Interaction for Autonomous Driving [80.8837864849534]
We introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout.<n>DeepInteraction++ is a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder.<n>Experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.
arXiv Detail & Related papers (2024-08-09T14:04:21Z) - A Pure Transformer Pretraining Framework on Text-attributed Graphs [50.833130854272774]
We introduce a feature-centric pretraining perspective by treating graph structure as a prior.
Our framework, Graph Sequence Pretraining with Transformer (GSPT), samples node contexts through random walks.
GSPT can be easily adapted to both node classification and link prediction, demonstrating promising empirical success on various datasets.
arXiv Detail & Related papers (2024-06-19T22:30:08Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - ULTRA-DP: Unifying Graph Pre-training with Multi-task Graph Dual Prompt [67.8934749027315]
We propose a unified framework for graph hybrid pre-training which injects the task identification and position identification into GNNs.
We also propose a novel pre-training paradigm based on a group of $k$-nearest neighbors.
arXiv Detail & Related papers (2023-10-23T12:11:13Z) - Cascaded Human-Object Interaction Recognition [175.60439054047043]
We introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding.
At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network.
With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding.
arXiv Detail & Related papers (2020-03-09T17:05:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.