Fugu-MT 論文翻訳(概要): MOON: Generative MLLM-based Multimodal Representation Learning for E-commerce Product Understanding

論文の概要: MOON: Generative MLLM-based Multimodal Representation Learning for E-commerce Product Understanding

arxiv url: http://arxiv.org/abs/2508.11999v1
Date: Sat, 16 Aug 2025 09:59:25 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-19 14:49:10.507393
Title: MOON: Generative MLLM-based Multimodal Representation Learning for E-commerce Product Understanding
Title（参考訳）: MOON:eコマース製品理解のためのMLLMに基づくマルチモーダル表現学習
Authors: Daoze Zhang, Zhanheng Nie, Jianyu Liu, Chenghan Fu, Wanxian Guan, Yuan Gao, Jun Song, Pengjie Wang, Jian Xu, Bo Zheng,
Abstract要約: 生成型多モーダル大規模言語モデルは,製品表現学習の改善に重要な可能性を秘めている。製品表現学習のための第1世代MLLMモデルMOONを提案する。本手法では,マルチモーダルおよびアスペクト特化商品のターゲットモデリングに,Mixture-of-Experts (MoE) モジュールを用いた。
参考スコア（独自算出の注目度）: 19.89836326556511
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the rapid advancement of e-commerce, exploring general representations rather than task-specific ones has attracted increasing research attention. For product understanding, although existing discriminative dual-flow architectures drive progress in this field, they inherently struggle to model the many-to-one alignment between multiple images and texts of products. Therefore, we argue that generative Multimodal Large Language Models (MLLMs) hold significant potential for improving product representation learning. Nevertheless, achieving this goal still remains non-trivial due to several key challenges: the lack of multimodal and aspect-aware modeling modules in typical LLMs; the common presence of background noise in product images; and the absence of a standard benchmark for evaluation. To address these issues, we propose the first generative MLLM-based model named MOON for product representation learning. Our method (1) employs a guided Mixture-of-Experts (MoE) module for targeted modeling of multimodal and aspect-specific product content; (2) effectively detects core semantic regions in product images to mitigate the distraction and interference caused by background noise; and (3) introduces the specialized negative sampling strategy to increase the difficulty and diversity of negative samples. In addition, we release a large-scale multimodal benchmark MBE for various product understanding tasks. Experimentally, our model demonstrates competitive zero-shot performance on both our benchmark and the public dataset, showcasing strong generalization across various downstream tasks, including cross-modal retrieval, product classification, and attribute prediction. Furthermore, the case study and visualization illustrate the effectiveness of MOON for product understanding.
Abstract（参考訳）: 電子商取引の急速な進歩に伴い、タスク固有のものよりも一般的な表現を探究することが研究の注目を集めている。製品を理解するためには、既存の差別的な二重フローアーキテクチャがこの分野で進歩を加速させるが、本質的には複数の画像と製品のテキスト間の多対一のアライメントをモデル化するのに苦労する。したがって、生成型マルチモーダル大規模言語モデル(MLLM)は、製品表現学習を改善する上で大きな可能性を秘めている。しかしながら、この目標を達成することは、典型的なLCMにおけるマルチモーダルおよびアスペクト対応モデリングモジュールの欠如、製品イメージにおけるバックグラウンドノイズの共通性、評価のための標準ベンチマークの欠如など、いくつかの重要な課題のために依然として簡単ではない。これらの課題に対処するために,製品表現学習のためのMOONというMLLMベースのモデルを提案する。提案手法では,マルチモーダルおよびアスペクト特化製品の内容のターゲットモデリングにMixture-of-Experts(MoE)モジュールを用い,背景雑音による障害や干渉を緩和するため,製品イメージのコアセマンティック領域を効果的に検出し,また,負のサンプルの難易度と多様性を高めるために,特殊な負のサンプリング戦略を導入する。さらに,各種製品理解タスクのための大規模マルチモーダルベンチマークMBEもリリースした。実験により,ベンチマークと公開データセットの両方でゼロショット性能の競争性を実証し,クロスモーダル検索,製品分類,属性予測など,さまざまな下流タスクに対する強力な一般化を示す。さらに、製品理解におけるMOONの有効性について、ケーススタディとビジュアライゼーションを行った。

論文の概要: MOON: Generative MLLM-based Multimodal Representation Learning for E-commerce Product Understanding

関連論文リスト