Related papers: Concept Drift Guided LayerNorm Tuning for Efficient Multimodal Metaphor Identification

Concept Drift Guided LayerNorm Tuning for Efficient Multimodal Metaphor Identification

URL: http://arxiv.org/abs/2505.11237v1
Date: Fri, 16 May 2025 13:27:57 GMT
Title: Concept Drift Guided LayerNorm Tuning for Efficient Multimodal Metaphor Identification
Authors: Wenhao Qian, Zhenzhen Hu, Zijie Song, Jia Li,
Abstract summary: This paper introduces textbfConcept textbfDrift textbfGuided textbfLayerNorm textbfTuning (textbfCDGLT), a novel and training-efficient framework for multimodal metaphor identification.<n> CDGLT achieves state-of-the-art performance on the MET-Meme benchmark while significantly reducing training costs compared to existing generative methods.
Score: 14.958038983995008
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Metaphorical imagination, the ability to connect seemingly unrelated concepts, is fundamental to human cognition and communication. While understanding linguistic metaphors has advanced significantly, grasping multimodal metaphors, such as those found in internet memes, presents unique challenges due to their unconventional expressions and implied meanings. Existing methods for multimodal metaphor identification often struggle to bridge the gap between literal and figurative interpretations. Additionally, generative approaches that utilize large language models or text-to-image models, while promising, suffer from high computational costs. This paper introduces \textbf{C}oncept \textbf{D}rift \textbf{G}uided \textbf{L}ayerNorm \textbf{T}uning (\textbf{CDGLT}), a novel and training-efficient framework for multimodal metaphor identification. CDGLT incorporates two key innovations: (1) Concept Drift, a mechanism that leverages Spherical Linear Interpolation (SLERP) of cross-modal embeddings from a CLIP encoder to generate a new, divergent concept embedding. This drifted concept helps to alleviate the gap between literal features and the figurative task. (2) A prompt construction strategy, that adapts the method of feature extraction and fusion using pre-trained language models for the multimodal metaphor identification task. CDGLT achieves state-of-the-art performance on the MET-Meme benchmark while significantly reducing training costs compared to existing generative methods. Ablation studies demonstrate the effectiveness of both Concept Drift and our adapted LN Tuning approach. Our method represents a significant step towards efficient and accurate multimodal metaphor understanding. The code is available: \href{https://github.com/Qianvenh/CDGLT}{https://github.com/Qianvenh/CDGLT}.

Related papers

Beyond Pixels: Visual Metaphor Transfer via Schema-Driven Agentic Reasoning [56.24016465596292]
A visual metaphor constitutes a high-order form of human creativity, employing cross-domain semantic fusion to transform abstract concepts into impactful visual rhetoric.<n>We introduce the task of Visual Metaphor Transfer (VMT), which challenges models to autonomously decouple the "creative essence" from a reference image and re-materialize that abstract logic onto a user-specified subject.<n>Our method significantly outperforms SOTA baselines in metaphor consistency, analogy appropriateness, and visual creativity, paving the way for automated high-impact creative applications in advertising and media.
arXiv Detail & Related papers (2026-02-01T17:01:36Z)
Query-Kontext: An Unified Multimodal Model for Image Generation and Editing [53.765351127477224]
Unified Multimodal Models (UMMs) have demonstrated remarkable performance in text-to-image generation (T2I) and editing (TI2I)<n>We introduce Query-Kontext, a novel approach that bridges the VLM and diffusion model via a multimodal kontext'' composed of semantic cues and coarse-grained image conditions encoded from multimodal inputs.<n> Experiments show that our approach matches strong unified baselines and even outperforms task-specific state-of-the-art methods in several cases.
arXiv Detail & Related papers (2025-09-30T17:59:46Z)
Metaphor identification using large language models: A comparison of RAG, prompt engineering, and fine-tuning [0.6524460254566904]
This study investigates the potential of large language models (LLMs) to automate metaphor identification in full texts.<n>We compare three methods: (i) retrieval-augmented generation (RAG), where the model is provided with a codebook and instructed to annotate texts based on its rules and examples; (ii) prompt engineering, where we design task-specific verbal instructions; and (iii) fine-tuning, where the model is trained on hand-coded texts to optimize performance.
arXiv Detail & Related papers (2025-09-29T14:50:18Z)
Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations [33.11867433769496]
This paper presents a framework that attempts to unify visual understanding and generation within a shared semantic representation.<n>At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into discrete tokens using a text-aligned codebook projected from a large language model's (LLM) vocabulary.<n> Experiments across benchmarks show that Tar matches or surpasses existing multimodal LLM methods, achieving faster convergence and greater training efficiency.
arXiv Detail & Related papers (2025-06-23T17:59:14Z)
Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints [15.541287957548771]
We propose a Coarse-to-fine Consistency Constraints Visual Grounding architecture.<n>It integrates implicit and explicit modeling approaches within a two-stage framework.<n>It significantly outperforms state-of-the-art REC and RIS methods by a substantial margin.
arXiv Detail & Related papers (2025-01-12T04:30:13Z)
Towards Multimodal Metaphor Understanding: A Chinese Dataset and Model for Metaphor Mapping Identification [9.08615188602226]
We develop a Chinese multimodal metaphor advertisement dataset (namely CM3D) that includes annotations of specific target and source domains.<n>We propose a Chain-of-NLP (CoT) Prompting-based Metaphor Mapping Identification Model (CPMMIM) which simulates the human cognitive process for identifying these mappings.
arXiv Detail & Related papers (2025-01-05T04:15:03Z)
Unified Generative and Discriminative Training for Multi-modal Large Language Models [88.84491005030316]
Generative training has enabled Vision-Language Models (VLMs) to tackle various complex tasks. Discriminative training, exemplified by models like CLIP, excels in zero-shot image-text classification and retrieval. This paper proposes a unified approach that integrates the strengths of both paradigms.
arXiv Detail & Related papers (2024-11-01T01:51:31Z)
Financial Models in Generative Art: Black-Scholes-Inspired Concept Blending in Text-to-Image Diffusion [57.03116054807942]
We introduce a novel approach for concept blending in pretrained text-to-image diffusion models.<n>We derive a robust algorithm for concept blending that capitalizes on the Markovian dynamics of the Black-Scholes framework.<n>Our work shows that financially inspired techniques can enhance text-to-image concept blending in generative AI.
arXiv Detail & Related papers (2024-05-22T14:25:57Z)
Non-confusing Generation of Customized Concepts in Diffusion Models [135.4385383284657]
We tackle the common challenge of inter-concept visual confusion in compositional concept generation using text-guided diffusion models (TGDMs) Existing customized generation methods only focus on fine-tuning the second stage while overlooking the first one. We propose a simple yet effective solution called CLIF: contrastive image-language fine-tuning.
arXiv Detail & Related papers (2024-05-11T05:01:53Z)
Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings [61.04460792203266]
We introduce VCoT, a novel method that leverages chain-of-thought prompting with vision-language grounding to bridge the logical gaps within sequential data. Our method uses visual guidance to generate synthetic multimodal infillings that add consistent and novel information to reduce the logical gaps for downstream tasks.
arXiv Detail & Related papers (2023-05-03T17:58:29Z)
Meta-Learning via Classifier(-free) Guidance [5.812784742024491]
State-of-the-art meta-learning techniques do not optimize for zero-shot adaptation to unseen tasks. We propose meta-learning techniques that use natural language guidance to achieve higher zero-shot performance.
arXiv Detail & Related papers (2022-10-17T11:09:35Z)
Metaphor Generation with Conceptual Mappings [58.61307123799594]
We aim to generate a metaphoric sentence given a literal expression by replacing relevant verbs. We propose to control the generation process by encoding conceptual mappings between cognitive domains. We show that the unsupervised CM-Lex model is competitive with recent deep learning metaphor generation systems.
arXiv Detail & Related papers (2021-06-02T15:27:05Z)
StEP: Style-based Encoder Pre-training for Multi-modal Image Synthesis [68.3787368024951]
We propose a novel approach for multi-modal Image-to-image (I2I) translation. We learn a latent embedding, jointly with the generator, that models the variability of the output domain. Specifically, we pre-train a generic style encoder using a novel proxy task to learn an embedding of images, from arbitrary domains, into a low-dimensional style latent space.
arXiv Detail & Related papers (2021-04-14T19:58:24Z)
MERMAID: Metaphor Generation with Symbolism and Discriminative Decoding [22.756157298168127]
Based on a theoretically-grounded connection between metaphors and symbols, we propose a method to automatically construct a parallel corpus. For the generation task, we incorporate a metaphor discriminator to guide the decoding of a sequence to sequence model fine-tuned on our parallel data. A task-based evaluation shows that human-written poems enhanced with metaphors are preferred 68% of the time compared to poems without metaphors.
arXiv Detail & Related papers (2021-03-11T16:39:19Z)
Linguistic Structure Guided Context Modeling for Referring Image Segmentation [61.701577239317785]
We propose a "gather-propagate-distribute" scheme to model multimodal context by cross-modal interaction. Our LSCM module builds a Dependency Parsing Tree Word Graph (DPT-WG) which guides all the words to include valid multimodal context of the sentence.
arXiv Detail & Related papers (2020-10-01T16:03:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.