Fugu-MT 論文翻訳(概要): FlexMUSE: Multimodal Unification and Semantics Enhancement Framework with Flexible interaction for Creative Writing

論文の概要: FlexMUSE: Multimodal Unification and Semantics Enhancement Framework with Flexible interaction for Creative Writing

arxiv url: http://arxiv.org/abs/2508.16230v1
Date: Fri, 22 Aug 2025 09:01:48 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-25 16:42:36.330078
Title: FlexMUSE: Multimodal Unification and Semantics Enhancement Framework with Flexible interaction for Creative Writing
Title（参考訳）: FlexMUSE: 創造的記述のためのフレキシブルインタラクションを備えたマルチモーダル統一とセマンティックス強化フレームワーク
Authors: Jiahao Chen, Zhiyong Ma, Wenbiao Du, Qingyuan Chuai,
Abstract要約: マルチモーダル・クリエイティブ・ライティング(MMCW)は、イラスト入り記事を作成することを目的としている。 MMCWは完全に新しい、より抽象的な課題であり、テキストと視覚のコンテキストは互いに厳密に関連していない。任意の視覚入力を可能にするために,T2Iモジュールを用いたFlexMUSEを提案する。
参考スコア（独自算出の注目度）: 4.587146567965601
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-modal creative writing (MMCW) aims to produce illustrated articles. Unlike common multi-modal generative (MMG) tasks such as storytelling or caption generation, MMCW is an entirely new and more abstract challenge where textual and visual contexts are not strictly related to each other. Existing methods for related tasks can be forcibly migrated to this track, but they require specific modality inputs or costly training, and often suffer from semantic inconsistencies between modalities. Therefore, the main challenge lies in economically performing MMCW with flexible interactive patterns, where the semantics between the modalities of the output are more aligned. In this work, we propose FlexMUSE with a T2I module to enable optional visual input. FlexMUSE promotes creativity and emphasizes the unification between modalities by proposing the modality semantic alignment gating (msaGate) to restrict the textual input. Besides, an attention-based cross-modality fusion is proposed to augment the input features for semantic enhancement. The modality semantic creative direct preference optimization (mscDPO) within FlexMUSE is designed by extending the rejected samples to facilitate the writing creativity. Moreover, to advance the MMCW, we expose a dataset called ArtMUSE which contains with around 3k calibrated text-image pairs. FlexMUSE achieves promising results, demonstrating its consistency, creativity and coherence.
Abstract（参考訳）: マルチモーダル・クリエイティブ・ライティング(MMCW)は、イラスト入り記事を作成することを目的としている。ストーリーテリングやキャプション生成のような一般的なマルチモーダル生成(MMG)タスクとは異なり、MCCWは完全に新しく抽象的な課題であり、テキストと視覚的コンテキストは厳密には関係しない。既存のタスクの方法は強制的にこのトラックに移行することができるが、それらは特定のモダリティ入力やコストのかかる訓練を必要とし、しばしばモダリティ間の意味的不整合に悩まされる。したがって、主な課題は、出力のモダリティ間のセマンティクスがより整合している柔軟な対話パターンでMCCWを経済的に実行することである。本稿では,任意の視覚入力を可能にするために,T2Iモジュールを用いたFlexMUSEを提案する。 FlexMUSEは創造性を促進し、モダリティ間の統一を強調し、テキスト入力を制限するためにモダリティセマンティックアライメントゲーティング(msaGate)を提案している。さらに, セマンティックエンハンスメントのための入力機能を強化するために, 注意に基づく相互モーダリティ融合を提案する。 FlexMUSE内のモダリティ・セマンティック・クリエイティブ・ダイレクト・プライオリティ・最適化(mscDPO)は、記述のクリエイティビティを促進するために、削除されたサンプルを拡張して設計されている。さらに、MCCWを前進させるために、約3kの校正されたテキストイメージペアを含むArtMUSEと呼ばれるデータセットを公開する。 FlexMUSEは、一貫性、創造性、一貫性を示し、有望な結果を達成する。

論文の概要: FlexMUSE: Multimodal Unification and Semantics Enhancement Framework with Flexible interaction for Creative Writing

関連論文リスト