Fugu-MT 論文翻訳(概要): ChemVLR: Prioritizing Reasoning in Perception for Chemical Vision-Language Understanding

論文の概要: ChemVLR: Prioritizing Reasoning in Perception for Chemical Vision-Language Understanding

arxiv url: http://arxiv.org/abs/2604.06685v1
Date: Wed, 08 Apr 2026 05:01:59 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-09 17:30:51.346079
Title: ChemVLR: Prioritizing Reasoning in Perception for Chemical Vision-Language Understanding
Title（参考訳）: ChemVLR:化学ビジョンの理解における推論の優先順位付け
Authors: Xuanle Zhao, Xinyuan Cai, Xiang Cheng, Xiuyi Chen, Bo Xu,
Abstract要約: 本稿では,認識過程における推論の優先順位付けを目的とした化学VLMであるChemVLRを紹介する。従来の化学VLMとは異なり、ChemVLRは視覚入力をきめ細かい方法で分析する。 ChemVLRは、複雑な視覚化学的問題に対する明示的で解釈可能な推論経路を生成する。
参考スコア（独自算出の注目度）: 18.366771283768344
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While Vision-Language Models (VLMs) have demonstrated significant potential in chemical visual understanding, current models are predominantly optimized for direct visual question-answering tasks. This paradigm often results in "black-box" systems that fail to utilize the inherent capability of Large Language Models (LLMs) to infer underlying reaction mechanisms. In this work, we introduce ChemVLR, a chemical VLM designed to prioritize reasoning within the perception process. Unlike conventional chemical VLMs, ChemVLR analyzes visual inputs in a fine-grained manner by explicitly identifying granular chemical descriptors, such as functional groups, prior to generating answers. This approach ensures the production of explicit and interpretable reasoning paths for complex visual chemical problems. To facilitate this methodology, we implement a cross-modality reverse-engineering strategy, combined with a rigorous filtering pipeline, to curate a large-scale reasoning-and-captioning dataset comprising 760k high-quality samples across molecular and reaction tasks. Furthermore, we adopt a three-stage training framework that systemically builds model perception and reasoning capacity. Experiments demonstrate that ChemVLR achieves state-of-the-art (SOTA) performance, surpassing both leading proprietary models and domain-specific open-source baselines. We also provide comprehensive ablation studies to validate our training strategy and data generation designs. Code and model weights will be available at https://github.com/xxlllz/ChemVLR.
Abstract（参考訳）: VLM(Vision-Language Models)は化学的な視覚的理解において大きな可能性を示しているが、現在のモデルは直接的な視覚的質問応答タスクに主に最適化されている。このパラダイムは、大きな言語モデル(LLM)の固有の能力を利用して、基盤となる反応機構を推論する「ブラックボックス」システムをもたらすことが多い。本稿では,認識過程における推論の優先順位付けを目的とした化学VLMであるChemVLRを紹介する。従来の化学VLMとは異なり、ChemVLRは、答えを生成する前に、機能基のような粒度の化学記述子を明示的に識別することで、視覚的な入力をきめ細かな方法で分析する。このアプローチは、複雑な視覚化学問題に対する明示的で解釈可能な推論経路の生成を保証する。この手法を実現するために,分子・反応タスク間で760万個の高品質なサンプルからなる大規模推論・カプセル化データセットをキュレートするために,厳密なフィルタリングパイプラインと組み合わされたクロスモーダルリバースエンジニアリング戦略を実装した。さらに、モデル知覚と推論能力を体系的に構築する3段階のトレーニングフレームワークも採用しています。実験により、ChemVLRは最先端のプロプライエタリモデルとドメイン固有のオープンソースベースラインの両方を超越して、最先端(SOTA)のパフォーマンスを達成することが示された。また、トレーニング戦略とデータ生成設計を検証するための総合的なアブレーション研究も行っている。コードとモデルの重み付けはhttps://github.com/xxlllz/ChemVLR.comで入手できる。

論文の概要: ChemVLR: Prioritizing Reasoning in Perception for Chemical Vision-Language Understanding

関連論文リスト