Fugu-MT 論文翻訳(概要): I Spy a Metaphor: Large Language Models and Diffusion Models Co-Create Visual Metaphors

論文の概要: I Spy a Metaphor: Large Language Models and Diffusion Models Co-Create Visual Metaphors

arxiv url: http://arxiv.org/abs/2305.14724v1
Date: Wed, 24 May 2023 05:01:10 GMT
ステータス: 翻訳完了
システム内更新日: 2023-05-25 19:40:38.441934
Title: I Spy a Metaphor: Large Language Models and Diffusion Models Co-Create Visual Metaphors
Title（参考訳）: I Spy a Metaphor: 大規模言語モデルと拡散モデル
Authors: Tuhin Chakrabarty, Arkadiy Saakyan, Olivia Winn, Artemis Panagopoulou, Yue Yang, Marianna Apidianaki, Smaranda Muresan
Abstract要約: 言語メタファーから視覚的メタファーを生成するための新しい課題を提案する。これは、暗黙的な意味と構成性をモデル化する能力を必要とするため、拡散ベースのテキスト-画像モデルにとって難しいタスクである。我々は1,540の言語メタファーとそれに関連する視覚的エラボレートのための6,476の視覚的メタファーを含む高品質なデータセットを作成する。
参考スコア（独自算出の注目度）: 38.70166865926743
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual metaphors are powerful rhetorical devices used to persuade or communicate creative ideas through images. Similar to linguistic metaphors, they convey meaning implicitly through symbolism and juxtaposition of the symbols. We propose a new task of generating visual metaphors from linguistic metaphors. This is a challenging task for diffusion-based text-to-image models, such as DALL$\cdot$E 2, since it requires the ability to model implicit meaning and compositionality. We propose to solve the task through the collaboration between Large Language Models (LLMs) and Diffusion Models: Instruct GPT-3 (davinci-002) with Chain-of-Thought prompting generates text that represents a visual elaboration of the linguistic metaphor containing the implicit meaning and relevant objects, which is then used as input to the diffusion-based text-to-image models.Using a human-AI collaboration framework, where humans interact both with the LLM and the top-performing diffusion model, we create a high-quality dataset containing 6,476 visual metaphors for 1,540 linguistic metaphors and their associated visual elaborations. Evaluation by professional illustrators shows the promise of LLM-Diffusion Model collaboration for this task.To evaluate the utility of our Human-AI collaboration framework and the quality of our dataset, we perform both an intrinsic human-based evaluation and an extrinsic evaluation using visual entailment as a downstream task.
Abstract（参考訳）: 視覚的メタファーは、画像を通じて創造的なアイデアを説得または伝達するために使用される強力な修辞装置である。言語的メタファーと同様に、記号主義や記号の並置を通じて暗黙的に意味を伝える。言語メタファーから視覚的メタファーを生成する新しい課題を提案する。 DALL$\cdot$E 2のような拡散ベースのテキスト-画像モデルでは、暗黙的な意味と構成性をモデル化する必要があるため、これは難しいタスクである。 We propose to solve the task through the collaboration between Large Language Models (LLMs) and Diffusion Models: Instruct GPT-3 (davinci-002) with Chain-of-Thought prompting generates text that represents a visual elaboration of the linguistic metaphor containing the implicit meaning and relevant objects, which is then used as input to the diffusion-based text-to-image models.Using a human-AI collaboration framework, where humans interact both with the LLM and the top-performing diffusion model, we create a high-quality dataset containing 6,476 visual metaphors for 1,540 linguistic metaphors and their associated visual elaborations. プロのイラストレーターによる評価は,LLM-Diffusion Modelコラボレーションの課題に対する可能性を示し,人間とAIのコラボレーションフレームワークの有用性とデータセットの質を評価するために,本質的な人間による評価と,視覚的エンテーメントを下流タスクとして用いた外在的評価の両方を行う。

関連論文リスト

Vision-Speech Models: Teaching Speech Models to Converse about Images [67.62394024470528]
我々は、MoshiVisを導入し、最近の対話音声LLM、Moshiを軽量適応モジュールを通して視覚的に入力する。追加の動的ゲーティング機構により、モデルが視覚的な入力と無関係な会話トピックをより簡単に切り替えることができる。音声とテキストの両方のプロンプトを用いて下流視覚理解タスクのモデルを評価し,MoshiVisとのインタラクションの質的なサンプルを報告する。
論文参考訳（メタデータ） (2025-03-19T18:40:45Z)
Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models [65.82564074712836]
テキストと画像の拡散モデルに光を流す新しいHOI検出器であるDIFfusionHOIを紹介する。まず、埋め込み空間における人間と物体の関係パターンの表現をインバージョンベースで学習する戦略を考案する。これらの学習された関係埋め込みはテキストのプロンプトとして機能し、スタイア拡散モデルが特定の相互作用を記述する画像を生成する。
論文参考訳（メタデータ） (2024-10-26T12:00:33Z)
Compositional Entailment Learning for Hyperbolic Vision-Language Models [54.41927525264365]
画像とテキストのペアを超えて、双曲的埋め込みの自然的階層性を完全に活用する方法を示す。双曲型視覚言語モデルのための構成的包摂学習を提案する。数百万の画像テキストペアで訓練された双曲型視覚言語モデルに対する経験的評価は、提案手法が従来のユークリッドCLIP学習より優れていることを示している。
論文参考訳（メタデータ） (2024-10-09T14:12:50Z)
A framework for annotating and modelling intentions behind metaphor use [12.40493670580608]
本稿では,9つのカテゴリーから構成されるメタファーに起因した意図の新たな分類法を提案する。また、メタファの使用の背後にある意図に注釈を付けた最初のデータセットもリリースしました。このデータセットを用いて、メタファー使用の背景にある意図を、ゼロテキストおよびインコンテキストの少数ショット設定で推測する際の、大きな言語モデル(LLM)の機能をテストする。
論文参考訳（メタデータ） (2024-07-04T14:13:57Z)
Unveiling the Invisible: Captioning Videos with Metaphors [43.53477124719281]
本稿では,VL(Vision-Language)タスクについて紹介する。この作業を容易にするために,705の動画と2115の人書きキャプションでデータセットを構築し,リリースする。また,提案課題における SoTA ビデオ言語モデルに匹敵する性能を持つ低リソースなビデオメタファキャプションシステム GIT-LLaVA を提案する。
論文参考訳（メタデータ） (2024-06-07T12:32:44Z)
In-Context Learning Unlocked for Diffusion Models [163.54453915874402]
本稿では,拡散に基づく生成モデルにおいて,文脈内学習を可能にするフレームワークであるPrompt Diffusionを提案する。本稿では,幅広い視覚言語タスクをモデル化可能な視覚言語プロンプトと,それを入力とする拡散モデルを提案する。結果として得られるPrompt Diffusionモデルは、文脈内学習が可能な初めての拡散に基づく視覚言語基礎モデルである。
論文参考訳（メタデータ） (2023-05-01T23:03:37Z)
IRFL: Image Recognition of Figurative Language [20.472997304393413]
図形は、しばしば複数のモダリティ(例えば、テキストと画像の両方)を通して伝達される。我々は、図形言語データセットの画像認識を開発する。マルチモーダルな図形言語理解のためのベンチマークとして,2つの新しいタスクを導入する。
論文参考訳（メタデータ） (2023-03-27T17:59:55Z)
MetaCLUE: Towards Comprehensive Visual Metaphors Research [43.604408485890275]
本稿では,視覚的メタファの視覚的タスクであるMetaCLUEを紹介する。我々は、アノテーションに基づいて、視覚と言語における最先端モデルの包括的分析を行う。この研究が、人間のようなクリエイティブな能力を持つAIシステムを開発するための具体的なステップを提供することを期待している。
論文参考訳（メタデータ） (2022-12-19T22:41:46Z)
Metaphor Generation with Conceptual Mappings [58.61307123799594]
我々は、関連する動詞を置き換えることで、リテラル表現を与えられた比喩文を生成することを目指している。本稿では,認知領域間の概念マッピングを符号化することで生成過程を制御することを提案する。教師なしCM-Lexモデルは,近年のディープラーニングメタファ生成システムと競合することを示す。
論文参考訳（メタデータ） (2021-06-02T15:27:05Z)
MERMAID: Metaphor Generation with Symbolism and Discriminative Decoding [22.756157298168127]
メタファーとシンボル間の理論的に基底的な接続に基づいて,並列コーパスを自動構築する手法を提案する。生成タスクには、並列データに微調整されたシーケンスモデルへのシーケンスの復号を導くためのメタファ判別器を組み込んだ。課題に基づく評価では、比喩のない詩に比べて、比喩で強化された人文詩が68%の時間を好むことが示されている。
論文参考訳（メタデータ） (2021-03-11T16:39:19Z)
Probing Contextual Language Models for Common Ground with Visual Representations [76.05769268286038]
我々は、マッチングと非マッチングの視覚表現を区別する上で、テキストのみの表現がいかに効果的かを評価するための探索モデルを設計する。以上の結果から,言語表現だけでは,適切な対象カテゴリから画像パッチを検索する強力な信号が得られることがわかった。視覚的に接地された言語モデルは、例えば検索においてテキストのみの言語モデルよりわずかに優れているが、人間よりもはるかに低い。
論文参考訳（メタデータ） (2020-05-01T21:28:28Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。