Fugu-MT 論文翻訳(概要): Text2Token: Unsupervised Text Representation Learning with Token Target Prediction

論文の概要: Text2Token: Unsupervised Text Representation Learning with Token Target Prediction

arxiv url: http://arxiv.org/abs/2510.10224v1
Date: Sat, 11 Oct 2025 14:00:45 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 18:06:29.849048
Title: Text2Token: Unsupervised Text Representation Learning with Token Target Prediction
Title（参考訳）: Text2Token:Tokenターゲット予測による教師なしテキスト表現学習
Authors: Ruize An, Richong Zhang, Zhijie Nie, Zhanyu Wu, Yanzhao Zhang, Dingkun Long,
Abstract要約: 教師なしテキスト表現学習(TRL)は、ウェブの未ラベルテキストによる検索とレコメンデーションを改善するのに有用である。最近の実証的研究により、高品質な表現は入力テキストのキートークンと一致していることがわかった。 TRL, Text2Token のための教師なし生成フレームワークを開発した。
参考スコア（独自算出の注目度）: 33.981873901056765
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Unsupervised text representation learning (TRL) is a fundamental task in natural language processing, which is beneficial for improving search and recommendations with the web's unlabeled texts. A recent empirical study finds that the high-quality representation aligns with the key token of the input text, uncovering the potential connection between representation space and vocabulary space. Inspired by the findings, we revisit the generative tasks and develop an unsupervised generative framework for TRL, Text2Token. The framework is based on the token target prediction task, utilizing carefully constructed target token distribution as supervisory signals. To construct the high-quality target token distribution, we analyze the token-alignment properties with advanced embedders and identify two essential categories of key tokens: (1) the meaningful tokens in the text and (2) semantically derived tokens beyond the text. Based on these insights, we propose two methods -- data-driven and model-derived -- to construct synthetic token targets from data or the LLM backbone. Experiments on the MTEB v2 benchmark demonstrate that Text2Token achieves performance competitive with the state-of-the-art embedder with unsupervised contrastive learning, LLM2Vec. Our analysis further shows that vocabulary and representation spaces optimize together and toward the optimum solution during training, providing new ideas and insights for future work.
Abstract（参考訳）: 非教師付きテキスト表現学習(TRL)は自然言語処理の基本課題であり、Webの未ラベルテキストによる検索とレコメンデーションの改善に有用である。最近の経験的研究では、高品質な表現は入力テキストのキートークンと一致し、表現空間と語彙空間の間の潜在的な関係を明らかにする。本研究は, TRL, Text2Tokenの再生作業を再考し, TRLのための教師なし生成フレームワークを開発した。このフレームワークはトークンターゲット予測タスクに基づいており、注意深く構築されたターゲットトークン分布を監視信号として利用している。高品質な目標トークン分布を構築するために,先進的な埋め込み器を用いてトークンアライメント特性を分析し,(1)テキスト中の有意義なトークン,(2)テキスト以外の意味的に派生したトークンの2つの重要なカテゴリを識別する。これらの知見に基づいて,データやLPMのバックボーンから合成トークンターゲットを構築するために,データ駆動型とモデル由来の2つの手法を提案する。 MTEB v2ベンチマークの実験では、Text2Tokenは、教師なしのコントラスト学習(LLM2Vec)を備えた最先端の組込み機と性能的に競合することを示した。我々の分析は、語彙空間と表現空間が、トレーニング中の最適解を共に最適化し、将来の作業に新たなアイデアと洞察を提供することを示す。

論文の概要: Text2Token: Unsupervised Text Representation Learning with Token Target Prediction

関連論文リスト