Fugu-MT 論文翻訳(概要): CMedTEB & CARE: Benchmarking and Enabling Efficient Chinese Medical Retrieval via Asymmetric Encoders

論文の概要: CMedTEB & CARE: Benchmarking and Enabling Efficient Chinese Medical Retrieval via Asymmetric Encoders

arxiv url: http://arxiv.org/abs/2604.10937v1
Date: Mon, 13 Apr 2026 03:14:31 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-14 20:13:16.288214
Title: CMedTEB & CARE: Benchmarking and Enabling Efficient Chinese Medical Retrieval via Asymmetric Encoders
Title（参考訳）: CMedTEB & CARE:非対称エンコーダによる効率的な中国の医用検索のベンチマークと評価
Authors: Angqing Jiang, Jianlyu Chen, Zhe Fang, Yongcan Wang, Xinpeng Li, Keyu Ding, Defu Lian,
Abstract要約: 本稿では,3種類の実践的埋め込みタスクにまたがるベンチマークである中国医療テキスト埋め込みベンチマーク(CMedTEB)を紹介する。 CMedTEBは、純粋に自動化されたデータセットとは別として、臨床専門家によって検証された厳格なマルチLLM投票パイプラインを通じてキュレートされる。我々は,オンラインクエリエンコーディングのための軽量BERT方式エンコーダと,オフライン文書エンコーディングのための強力なLCM方式エンコーダとを組み合わせた非対称アーキテクチャである中国医療非対称レトリバー(CARE)を提案する。
参考スコア（独自算出の注目度）: 38.724117073180444
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Effective medical text retrieval requires both high accuracy and low latency. While LLM-based embedding models possess powerful retrieval capabilities, their prohibitive latency and high computational cost limit their application in real-time scenarios. Furthermore, the lack of comprehensive and high-fidelity benchmarks hinders progress in Chinese medical text retrieval. In this work, we introduce the Chinese Medical Text Embedding Benchmark (CMedTEB), a benchmark spanning three kinds of practical embedding tasks: retrieval, reranking, and semantic textual similarity (STS). Distinct from purely automated datasets, CMedTEB is curated via a rigorous multi-LLM voting pipeline validated by clinical experts, ensuring gold-standard label quality while effectively mitigating annotation noise. On this foundation, we propose the Chinese Medical Asymmetric REtriever (CARE), an asymmetric architecture that pairs a lightweight BERT-style encoder for online query encoding with a powerful LLM-based encoder for offline document encoding. However, optimizing such an asymmetric retriever with two structurally different encoders presents distinctive challenges. To address this, we introduce a novel two-stage training strategy that progressively bridges the query and document representations. Extensive experiments demonstrate that CARE surpasses state-of-the-art symmetric models on CMedTEB, achieving superior retrieval performance without increasing inference latency.
Abstract（参考訳）: 効果的な医療用テキスト検索には高い精度と低レイテンシが必要である。 LLMベースの埋め込みモデルは強力な検索機能を備えているが、その禁止されたレイテンシと高い計算コストにより、リアルタイムシナリオでのアプリケーションの利用が制限される。さらに、包括的な高忠実度ベンチマークの欠如は、中国の医学テキスト検索の進歩を妨げる。本研究では,検索,再ランク付け,意味的テキスト類似性(STS)の3種類の実践的埋め込みタスクにまたがるベンチマークである中国医療テキスト埋め込みベンチマーク(CMedTEB)を紹介する。 CMedTEBは、純粋に自動化されたデータセットとは別として、臨床専門家が検証した厳格なマルチLLM投票パイプラインを通じて、ゴールドスタンダードのラベル品質を確保しつつ、アノテーションノイズを効果的に緩和する。本研究は,オンラインクエリエンコーディングのための軽量BERT方式エンコーダと,オフライン文書エンコードのための強力なLCM方式エンコーダを組み合わせた非対称アーキテクチャである中国医療非対称レトリバー(CARE)を提案する。しかし、2つの構造的に異なるエンコーダでそのような非対称なレトリバーを最適化することは、顕著な課題である。そこで本研究では,クエリと文書表現を段階的にブリッジする新たな2段階学習手法を提案する。広範囲な実験により,CAREはCMedTEBの最先端対称モデルを超え,推論遅延を増大させることなく検索性能が向上することを示した。

論文の概要: CMedTEB & CARE: Benchmarking and Enabling Efficient Chinese Medical Retrieval via Asymmetric Encoders

関連論文リスト