Fugu-MT 論文翻訳(概要): Fine-tuning a vision-language model for fracture-surface morphology recognition

論文の概要: Fine-tuning a vision-language model for fracture-surface morphology recognition

arxiv url: http://arxiv.org/abs/2605.07145v1
Date: Fri, 08 May 2026 02:26:36 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 19:43:38.748764
Title: Fine-tuning a vision-language model for fracture-surface morphology recognition
Title（参考訳）: き裂表面形態認識のための視覚言語モデル
Authors: Quanliang Liu, Jungtaek Kim, Kangwook Lee, Hyunseok Oh,
Abstract要約: 13,168画像のキュレートデータセットを用いて, き裂面画像解析のためのオープンソースのビジョン言語モデル(VLM)を微調整した。結果として得られたスペシャリストモデルは、手動で注釈付けされた100のベンチマークで、フラグシップのプロプライエタリなマルチモーダルモデルを上回っている。本稿では, 破壊特異的な視覚的精度とより広いマルチモーダル推論を組み合わせ, 自律フラクトログラフィーのためのファインチューニングモデルとプロプライエタリモデルの統合利用について論じる。
参考スコア（独自算出の注目度）: 20.872357530075153
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Vision-language models (VLMs) have shown strong potential for scientific image understanding, but general-purpose models often lack the domain-specific visual knowledge required for reliable materials characterization. In this work, we fine-tuned an open-source VLM (Qwen3-VL-32B-Instruct) for fracture-surface image analysis using a curated dataset of 13,168 open-source, literature-mined fracture-surface images. Morphology annotations were generated by GPT-5.2-Reasoning (high) from both the images and relevant excerpts of their source papers, and the dataset was further enriched with targeted manual collection and rotation-based augmentation. The resulting specialist model outperforms flagship proprietary multimodal models on a benchmark of 100 manually annotated images. It achieves a precision of 0.92, compared to 0.35 for the base Qwen3-VL-32B-Instruct, 0.58 for GPT-5.5-Reasoning (high), and 0.78 for Gemini 3.1 Pro-Reasoning (high). Dataset ablations show that manual collection of rare-feature images and augmentation via image rotation are both beneficial to improve recognition of less common fracture morphology features. We further discuss integrated use of the fine-tuned model with proprietary models to combine fracture-specific visual accuracy with broader multimodal reasoning for autonomous fractography. Although focused on fracture-surface images, this work demonstrates how VLMs can be adapted through targeted collection and fine-tuning on novel feature images to recognize those features and support downstream decision-making in autonomous microscopy workflows.
Abstract（参考訳）: 視覚言語モデル(VLM)は、科学的イメージ理解に強い可能性を示してきたが、汎用モデルは、信頼できる資料のキャラクタリゼーションに必要な、ドメイン固有の視覚知識を欠いていることが多い。本研究では,13,168個のオープンソースの文献によるき裂面画像を用いて,き裂面画像解析のためのオープンソースのVLM(Qwen3-VL-32B-Instruct)を微調整した。 GPT-5.2-Reasoning (high) によって、画像とソース文書の関連する抜粋の両方からモルフォロジーアノテーションが生成され、データセットはターゲット手動収集と回転に基づく拡張によってさらに強化された。結果として得られたスペシャリストモデルは、手動で注釈付けされた100のベンチマークで、フラグシップのプロプライエタリなマルチモーダルモデルを上回っている。 Qwen3-VL-32B-インストラクタでは0.35、GPT-5.5-Reasoning(ハイ)では0.58、Gemini 3.1 Pro-Reasoning(ハイ)では0.78である。データセットの短縮は、まれな画像の手作業による収集と、画像回転による増大はどちらも、より一般的な骨折形態の特徴の認識を改善するのに有用であることを示している。さらに, 破壊特異的な視覚的精度とより広いマルチモーダル推論を組み合わせ, 自律フラクトログラフィーのためのファインチューニングモデルとプロプライエタリモデルの統合について検討する。フラクチャー表面の画像に焦点が当てられているが、この研究は、VLMがターゲットとなるコレクションや新しい特徴画像の微調整によってどのように適応できるかを示し、これらの特徴を認識し、自律的な顕微鏡ワークフローにおいて下流の意思決定をサポートする。

論文の概要: Fine-tuning a vision-language model for fracture-surface morphology recognition

関連論文リスト