Fugu-MT 論文翻訳(概要): mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval

論文の概要: mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval

arxiv url: http://arxiv.org/abs/2604.17054v1
Date: Sat, 18 Apr 2026 16:23:05 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.311599
Title: mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval
Title（参考訳）: mEOL: ベクトルグラフィックスと画像検索のためのトレーニング不要インストラクションガイド型マルチモーダル埋め込み
Authors: Kyeong Seon Kim, Baek Seong-Eun, Lee Jung-Mok, Tae-Hyun Oh,
Abstract要約: トレーニング不要,命令誘導型マルチモーダル埋め込みフレームワークを提案する。我々は,モダリティ特異的な指示と構造的手がかりによって埋め込みの方向を制御する。 VGBenchの再使用により,最初のテキスト間SVG検索ベンチマークを構築した。
参考スコア（独自算出の注目度）: 23.372578915400613
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scalable Vector Graphics (SVGs) function both as visual images and as structured code that encode rich geometric and layout information, yet most methods rasterize them and discard this symbolic organization. At the same time, recent sentence embedding methods produce strong text representations but do not naturally extend to visual or structured modalities. We propose a training-free, instruction-guided multimodal embedding framework that uses a Multimodal Large Language Model (MLLM) to map text, raster images, and SVG code into an aligned embedding space. We control the direction of embeddings through modality-specific instructions and structural SVG cues, eliminating the need for learned projection heads or contrastive training. Our method has two key components: (1) Multimodal Explicit One-word Limitation (mEOL), which instructs the MLLM to summarize any multimodal input into a single token whose hidden state serves as a compact semantic embedding. (2) A semantic SVG rewriting module that assigns meaningful identifiers and simplifies nested SVG elements through visual reasoning over the rendered image, exposing geometric and relational cues hidden in raw code. Using a repurposed VGBench, we build the first text-to-SVG retrieval benchmark and show that our training-free embeddings outperform encoder-based and training-based multimodal baselines. These results highlight prompt-level control as an effective alternative to parameter-level training for structure-aware multimodal retrieval. Project page: https://scene-the-ella.github.io/meol/
Abstract（参考訳）: スケーラブルベクトルグラフィックス(SVG)は、ビジュアルイメージと、リッチな幾何学的およびレイアウト情報をエンコードする構造化コードの両方として機能するが、ほとんどのメソッドはそれらをラスタライズし、この象徴的な組織を捨てる。同時に、最近の文埋め込み手法は、強いテキスト表現を生成するが、自然に視覚的あるいは構造化されたモダリティにまで拡張しない。本稿では,MLLM(Multimodal Large Language Model)を用いて,テキスト,ラスタ画像,SVGコードをアライメントした埋め込み空間にマッピングする,学習自由な指導誘導型マルチモーダル埋め込みフレームワークを提案する。我々は,学習した投射頭や対照的な訓練の必要性を排除し,モダリティ固有の指示や構造的SVGによる埋め込みの方向制御を行う。本手法は,(1)マルチモーダル・エクスプリシット・ワンワード制限(mEOL, Multimodal Explicit One-word Limitation)という,隠れ状態がコンパクトなセマンティック・埋め込みとして機能する単一トークンに任意のマルチモーダル入力をまとめるようにMLLMに指示する。 2)有意義な識別子を割り当ててネストしたSVG要素を簡易化するセマンティックSVG書き換えモジュール。 VGBenchの再使用により、最初のテキストからSVG検索ベンチマークを構築し、トレーニング不要な埋め込みがエンコーダベースおよびトレーニングベースのマルチモーダルベースラインより優れていることを示す。これらの結果は、構造対応マルチモーダル検索のためのパラメータレベルトレーニングの効果的な代替手段として、プロンプトレベル制御を強調している。プロジェクトページ: https://scene-the-ella.github.io/meol/

論文の概要: mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval

関連論文リスト