Fugu-MT 論文翻訳(概要): PolyFrame at MWE-2026 AdMIRe 2: When Words Are Not Enough: Multimodal Idiom Disambiguation

論文の概要: PolyFrame at MWE-2026 AdMIRe 2: When Words Are Not Enough: Multimodal Idiom Disambiguation

arxiv url: http://arxiv.org/abs/2602.18652v1
Date: Fri, 20 Feb 2026 23:07:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-24 17:42:02.224656
Title: PolyFrame at MWE-2026 AdMIRe 2: When Words Are Not Enough: Multimodal Idiom Disambiguation
Title（参考訳）: ポリフレーム - MWE-2026 AdMIRE 2: 単語が十分でないとき:マルチモーダルイディオムの曖昧さ
Authors: Nina Hosseini-Kivanani,
Abstract要約: PolyFrameは、画像+テキストランキング(Subtask A)とテキストのみのキャプションランキング(Subtask B)の両方のための統合パイプラインである。全てのモデルでは、凍ったCLIPスタイルの視覚言語エンコーダと、軽量モジュールのみを訓練する多言語BGE M3エンコーダが保持されている。マルチリンガルブラインドテストでは,Subtask Aは0.35/0.73,Subtask Bは0.32/0.71であった。
参考スコア（独自算出の注目度）: 0.533024001730262
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal models struggle with idiomatic expressions due to their non-compositional meanings, a challenge amplified in multilingual settings. We introduced PolyFrame, our system for the MWE-2026 AdMIRe2 shared task on multimodal idiom disambiguation, featuring a unified pipeline for both image+text ranking (Subtask A) and text-only caption ranking (Subtask B). All model variants retain frozen CLIP-style vision--language encoders and the multilingual BGE M3 encoder, training only lightweight modules: a logistic regression and LLM-based sentence-type predictor, idiom synonym substitution, distractor-aware scoring, and Borda rank fusion. Starting from a CLIP baseline (26.7% Top-1 on English dev, 6.7% on English test), adding idiom-aware paraphrasing and explicit sentence-type classification increased performance to 60.0% Top-1 on English and 60.0% Top-1 (0.822 NDCG@5) in zero-shot transfer to Portuguese. On the multilingual blind test, our systems achieved average Top-1/NDCG scores of 0.35/0.73 for Subtask A and 0.32/0.71 for Subtask B across 15 languages. Ablation results highlight idiom-aware rewriting as the main contributor to performance, while sentence-type prediction and multimodal fusion enhance robustness. These findings suggest that effective idiom disambiguation is feasible without fine-tuning large multimodal encoders.
Abstract（参考訳）: マルチモーダルモデルは、非構成的意味から慣用的な表現に苦しむが、これは多言語設定で増幅される。我々は、MWE-2026 AdMIRe2共有タスクであるPolyFrameを導入し、画像+テキストランキング(Subtask A)とテキストのみのキャプションランキング(Subtask B)を統一したパイプラインを特徴とした。全てのモデル変種は、凍結したCLIPスタイルの視覚-言語エンコーダと多言語BGE M3エンコーダを保持し、ロジスティック回帰とLLMベースの文型予測器、イディオム同義語置換、イントラクタ・アウェア・スコアリング、ボルダ級数融合などの軽量モジュールのみを訓練する。 CLIPベースライン(英語開発では26.7%、英語テストでは6.7%)から始まり、イディオム対応のパラフレーズと明示的な文型分類を追加し、英語では60.0%、ポルトガル語では60.0%のTop-1(0.822 NDCG@5)に向上した。マルチリンガルブラインドテストでは,Subtask Aは0.35/0.73,Subtask Bは0.32/0.71であった。アブレーションの結果は,文型予測やマルチモーダル融合が堅牢性を高める一方で,イディオム認識による書き換えがパフォーマンスの主要因であることが示された。これらの結果から,大規模なマルチモーダルエンコーダを微調整することなく,効果的なイディオム曖昧化が可能であることが示唆された。

論文の概要: PolyFrame at MWE-2026 AdMIRe 2: When Words Are Not Enough: Multimodal Idiom Disambiguation

関連論文リスト