Fugu-MT 論文翻訳(概要): Mira-Embeddings-V1: Domain-Adapted Semantic Reranking for Recruitment via LLM-Synthesized Data

論文の概要: Mira-Embeddings-V1: Domain-Adapted Semantic Reranking for Recruitment via LLM-Synthesized Data

arxiv url: http://arxiv.org/abs/2604.17738v1
Date: Mon, 20 Apr 2026 02:51:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.667577
Title: Mira-Embeddings-V1: Domain-Adapted Semantic Reranking for Recruitment via LLM-Synthesized Data
Title（参考訳）: Mira-Embeddings-V1:LLM合成データによる再資源化のためのドメイン適応セマンティックリランク
Authors: Zhaohua Liang, Zhilin Wang, Renjie Cao, Yining Zhang,
Abstract要約: 採用ドメインのセマンティックリグレードシステムであるmira-embeddings-v1を提案する。実際のJDから始めて、5段階のプロンプトパイプラインを構築し、さまざまな正と強のサンプルを生成する。次に、JD--JDコントラストトレーニングとJD--CVトリプルトアライメントの2ラウンドLoRA適応を適用した。
参考スコア（独自算出の注目度）: 12.621394200451613
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Candidate sourcing for recruiters is best viewed as a two-stage retrieval and reranking pipeline with recall as the primary objective under a limited review budget. An upstream production retriever first returns a candidate shortlist for each job description (JD), and our goal is to rerank that shortlist so that qualified candidates appear as high as possible. We present mira-embeddings-v1, a semantic reranking system for the recruitment domain that reshapes the embedding space with LLM-synthesized training data and corrects boundary confusions with a lightweight reranking head. Starting from real JDs, we build a five-stage prompt pipeline to generate diverse positive and hard negative samples that sculpt the semantic space from multiple angles. We then apply a two-round LoRA adaptation: JD--JD contrastive training followed by JD--CV triplet alignment on a heterogeneous text dataset. Importantly, these gains require no large-scale manually labeled industrial training pairs: a modest set of real JDs is expanded into supervision through LLM synthesis. Finally, a BoundaryHead MLP reranks the Top-K results to distinguish between roles that share the same title but differ in scope. On a local pool of 300 real JDs with candidates from an upstream production retriever, mira-embeddings-v1 improves Recall@50 from 68.89% (baseline) to 77.55% while lifting Precision@10 from 35.77% to 39.62%. On a supportive global pool over 44,138 candidates judged by a Qwen3-32B rubric, it achieves Recall@200 of 0.7047 versus 0.5969 for the baseline. These results show that LLM-synthesized supervision with boundary-aware reranking yields robust gains without a heavy cross-encoder.
Abstract（参考訳）: リクルーターのための候補ソーシングは、2段階の検索およびリグレードパイプラインとして最も適しており、リコールは限定的なレビュー予算の下で主要な目的である。上流のプロダクションレトリバーは、まず、各ジョブ記述(JD)の候補のショートリストを返します。 LLM合成トレーニングデータで埋め込み空間を再現し、軽量なリグレードヘッドで境界の混乱を補正する、採用ドメインのセマンティックリグレードシステムであるMira-embeddings-v1を提案する。実際のJDから始まり、5段階のプロンプトパイプラインを構築し、多角から意味空間を彫刻する多様な正および強負のサンプルを生成する。次に、JD--JDコントラストトレーニングとJD--CVトリプルトアライメントの2ラウンドLoRA適応を適用した。重要な点として、これらの利得は大規模に手動でラベル付けされた工業用トレーニングペアを必要としない。最後に、BundaryHead MLPはTop-Kの結果をリランクして、同じタイトルを共有するがスコープが異なるロールを区別する。上流の生産レトリバーの候補である300個の実JDのローカルプールでは、mira-embeddings-v1はRecall@50を68.89%(ベースライン)から77.55%に改善し、Precision@10を35.77%から39.62%に引き上げている。 Qwen3-32Bルーリックで判断された44,138人を超える支持的なグローバルプールでは、ベースラインでは0.7047対0.5969のRecall@200を達成する。これらの結果から,LLMを合成した境界認識再配置による監視は,重み付きクロスエンコーダを使わずに頑健な利得が得られることがわかった。

論文の概要: Mira-Embeddings-V1: Domain-Adapted Semantic Reranking for Recruitment via LLM-Synthesized Data

関連論文リスト