Fugu-MT 論文翻訳(概要): Can LLM-Generated Text Empower Surgical Vision-Language Pre-training?

論文の概要: Can LLM-Generated Text Empower Surgical Vision-Language Pre-training?

arxiv url: http://arxiv.org/abs/2604.18134v1
Date: Mon, 20 Apr 2026 12:01:51 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.844123
Title: Can LLM-Generated Text Empower Surgical Vision-Language Pre-training?
Title（参考訳）: LLMによる手術用ビジョンランゲージ事前トレーニングは可能か?
Authors: Chengan Che, Chao Wang, Jiayuan Huang, Xinyue Chen, Luis C. Garcia-Peraza-Herrera,
Abstract要約: textbfLIMEは、人間の自由大言語モデル(LLM)を用いたオープンアクセス手術ビデオから派生した大規模なマルチモーダルデータセットである。 textbfSurgLIMEはパラメータ効率のよいビジョンランゲージ事前学習フレームワークで、信頼性の高いクロスモーダルアライメントを学習するために設計されている。
参考スコア（独自算出の注目度）: 10.452511741676437
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advancements in self-supervised learning have led to powerful surgical vision encoders capable of spatiotemporal understanding. However, extending these visual foundations to multi-modal reasoning tasks is severely bottlenecked by the prohibitive cost of expert textual annotations. To overcome this scalability limitation, we introduce \textbf{LIME}, a large-scale multi-modal dataset derived from open-access surgical videos using human-free, Large Language Model (LLM)-generated narratives. While LIME offers immense scalability, unverified generated texts may contain errors, including hallucinations, that could potentially lead to catastrophically degraded pre-trained medical priors in standard contrastive pipelines. To mitigate this, we propose \textbf{SurgLIME}, a parameter-efficient Vision-Language Pre-training (VLP) framework designed to learn reliable cross-modal alignments using noisy narratives. SurgLIME preserves foundational medical priors using a LoRA-adapted dual-encoder architecture and introduces an automated confidence estimation mechanism that dynamically down-weights uncertain text during contrastive alignment. Evaluations on the AutoLaparo and Cholec80 benchmarks show that SurgLIME achieves competitive zero-shot cross-modal alignment while preserving the robust linear probing performance of the visual foundation model. Dataset, code, and models are publicly available at \href{https://github.com/visurg-ai/SurgLIME}{https://github.com/visurg-ai/SurgLIME}.
Abstract（参考訳）: 近年の自己教師型学習の進歩は、時空間的理解が可能な強力な手術用視覚エンコーダを生み出している。しかし、これらの視覚的基礎をマルチモーダル推論タスクに拡張することは、専門家のテキストアノテーションの禁止コストによって著しくボトルネックとなる。このスケーラビリティの限界を克服するために,人間の自由なLarge Language Model (LLM) を用いたオープンアクセス手術ビデオから得られた大規模マルチモーダルデータセットである \textbf{LIME} を紹介する。 LIMEは膨大なスケーラビリティを提供するが、未検証のテキストには幻覚を含むエラーが含まれている可能性がある。これを軽減するために,パラメータ効率のよいビジョンランゲージ事前学習(VLP)フレームワークである‘textbf{SurgLIME} を提案する。 SurgLIMEは、LoRAに適応したデュアルエンコーダアーキテクチャを使用して基礎的な医学的先行情報を保存し、コントラストアライメント中に不確実テキストを動的にダウンウェイトする自動信頼推定機構を導入する。 AutoLaparo と Cholec80 ベンチマークの評価では、SurgLIME は、視覚基盤モデルの頑健な線形探索性能を維持しながら、競合するゼロショットのクロスモーダルアライメントを実現している。データセット、コード、モデルは \href{https://github.com/visurg-ai/SurgLIME}{https://github.com/visurg-ai/SurgLIME} で公開されている。

論文の概要: Can LLM-Generated Text Empower Surgical Vision-Language Pre-training?

関連論文リスト