Fugu-MT 論文翻訳(概要): Large Language Models Meet Extreme Multi-label Classification: Scaling and Multi-modal Framework

論文の概要: Large Language Models Meet Extreme Multi-label Classification: Scaling and Multi-modal Framework

arxiv url: http://arxiv.org/abs/2511.13189v1
Date: Mon, 17 Nov 2025 09:52:53 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-18 14:36:25.108339
Title: Large Language Models Meet Extreme Multi-label Classification: Scaling and Multi-modal Framework
Title（参考訳）: 極端に多ラベル分類が可能な大規模言語モデル:スケーリングとマルチモーダルフレームワーク
Authors: Diego Ortego, Marlon Rodríguez, Mario Almagro, Kunal Dahiya, David Jiménez, Juan C. SanMiguel,
Abstract要約: ファンデーションモデルは、多くの領域にわたって人工知能に革命をもたらしたが、その変革の可能性は、エクストリーム・マルチラベル分類(XMC)にほとんど使われていない。本稿では,より大規模なデコーダのみのモデルを効果的に活用する方法と,計算効率を保ちながら視覚情報を活用する方法について述べる。既存のテキストのみのデータセットを拡張して、ビジュアルメタデータを活用し、将来のベンチマークに利用できるようにする。
参考スコア（独自算出の注目度）: 7.629925808881079
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Foundation models have revolutionized artificial intelligence across numerous domains, yet their transformative potential remains largely untapped in Extreme Multi-label Classification (XMC). Queries in XMC are associated with relevant labels from extremely large label spaces, where it is critical to strike a balance between efficiency and performance. Therefore, many recent approaches efficiently pose XMC as a maximum inner product search between embeddings learned from small encoder-only transformer architectures. In this paper, we address two important aspects in XMC: how to effectively harness larger decoder-only models, and how to exploit visual information while maintaining computational efficiency. We demonstrate that both play a critical role in XMC separately and can be combined for improved performance. We show that a few billion-size decoder can deliver substantial improvements while keeping computational overhead manageable. Furthermore, our Vision-enhanced eXtreme Multi-label Learning framework (ViXML) efficiently integrates foundation vision models by pooling a single embedding per image. This limits computational growth while unlocking multi-modal capabilities. Remarkably, ViXML with small encoders outperforms text-only decoder in most cases, showing that an image is worth billions of parameters. Finally, we present an extension of existing text-only datasets to exploit visual metadata and make them available for future benchmarking. Comprehensive experiments across four public text-only datasets and their corresponding image enhanced versions validate our proposals' effectiveness, surpassing previous state-of-the-art by up to +8.21\% in P@1 on the largest dataset. ViXML's code is available at https://github.com/DiegoOrtego/vixml.
Abstract（参考訳）: ファンデーションモデルは、多くのドメインにわたって人工知能に革命をもたらしたが、その変革の可能性は、Extreme Multi-label Classification (XMC)にほとんど使われていない。 XMCのクエリは、非常に大きなラベル空間の関連ラベルと関連付けられており、効率と性能のバランスをとることが重要である。したがって、近年の多くのアプローチは、小さなエンコーダのみのトランスフォーマーアーキテクチャから学んだ埋め込み間の最大内部積探索としてXMCを効果的に採用している。本稿では、XMCにおける2つの重要な側面として、より大規模なデコーダのみのモデルを効果的に活用する方法と、計算効率を維持しながら視覚情報を活用する方法について述べる。我々は,XMCにおいて両者が個別に重要な役割を担い,性能向上のために組み合わせることができることを示した。数十億のデコーダは、計算オーバーヘッドを管理しつつ、大幅な改善を実現することができることを示す。さらに、Vision-enhanced eXtreme Multi-label Learning framework (ViXML)は、画像毎にひとつの埋め込みをプールすることで、基礎的なビジョンモデルを効率的に統合します。これにより、マルチモーダル機能をアンロックしながら、計算量の増加が制限される。注目すべきなのは、小さなエンコーダを持つViXMLはテキストのみのデコーダよりも優れており、画像が数十億のパラメータを持つことを示していることだ。最後に、視覚メタデータを活用するために既存のテキストのみのデータセットを拡張し、将来のベンチマークに利用できるようにする。 4つの公開テキストのみのデータセットとそれに対応する画像拡張バージョンにわたる総合的な実験は、我々の提案の有効性を検証する。 ViXMLのコードはhttps://github.com/DiegoOrtego/vixml.comで入手できる。

論文の概要: Large Language Models Meet Extreme Multi-label Classification: Scaling and Multi-modal Framework

関連論文リスト