Fugu-MT 論文翻訳(概要): Visual Puns from Idioms: An Iterative LLM-T2IM-MLLM Framework

論文の概要: Visual Puns from Idioms: An Iterative LLM-T2IM-MLLM Framework

arxiv url: http://arxiv.org/abs/2511.22943v1
Date: Fri, 28 Nov 2025 07:30:58 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-01 19:47:55.805571
Title: Visual Puns from Idioms: An Iterative LLM-T2IM-MLLM Framework
Title（参考訳）: Idioms による Visual Puns: 反復 LLM-T2IM-MLLM フレームワーク
Authors: Kelaiti Xiao, Liang Yang, Dongyu Zhang, Paerhati Tulajiang, Hongfei Lin,
Abstract要約: 我々は,イディオムのリテラルと図形的意味を一致させるイディオムに基づく視覚的刺激について検討した。本稿では,大規模言語モデル(LLM),テキスト・ツー・イメージモデル(T2IM),マルチモーダルLLM(MLLM)を協調する反復的フレームワークを提案する。イディオムを与えられたシステムは、詳細な視覚的プロンプトを反復的に生成し、画像からイディオムを推測し、認識が成功するかステップ限界に達するまでプロンプトを洗練する。
参考スコア（独自算出の注目度）: 22.840166101386625
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study idiom-based visual puns--images that align an idiom's literal and figurative meanings--and present an iterative framework that coordinates a large language model (LLM), a text-to-image model (T2IM), and a multimodal LLM (MLLM) for automatic generation and evaluation. Given an idiom, the system iteratively (i) generates detailed visual prompts, (ii) synthesizes an image, (iii) infers the idiom from the image, and (iv) refines the prompt until recognition succeeds or a step limit is reached. Using 1,000 idioms as inputs, we synthesize a corresponding dataset of visual pun images with paired prompts, enabling benchmarking of both generation and understanding. Experiments across 10 LLMs, 10 MLLMs, and one T2IM (Qwen-Image) show that MLLM choice is the primary performance driver: GPT achieves the highest accuracies, Gemini follows, and the best open-source MLLM (Gemma) is competitive with some closed models. On the LLM side, Claude attains the strongest average performance for prompt generation.
Abstract（参考訳）: 我々は,イディオムのリテラルと図形的意味を一致させるイディオムに基づく視覚的刺激について検討し,大言語モデル(LLM),テキスト・ツー・イメージ・モデル(T2IM),マルチモーダルLLM(MLLM)を協調して自動生成・評価する反復的枠組みを提案する。イディオムが与えられたら、システムは反復的に (i)詳細な視覚的プロンプトを生成する。 (ii)画像の合成三画像からイディオムを推測し、 (iv) 認識が成功するか段階限界に達するまで、プロンプトを洗練させる。 1000イディオムを入力として、ペア化されたプロンプトで対応する視覚的刺激画像のデータセットを合成し、生成と理解の両方のベンチマークを可能にする。 10のLLM、10のMLLM、10のT2IM(Qwen-Image)での実験では、MLLMの選択が主要なパフォーマンスドライバであることが示されている。 LLM側では、クロードは即時生成において最も高い平均性能を得る。

論文の概要: Visual Puns from Idioms: An Iterative LLM-T2IM-MLLM Framework

関連論文リスト