Fugu-MT 論文翻訳(概要): CodePercept: Code-Grounded Visual STEM Perception for MLLMs

論文の概要: CodePercept: Code-Grounded Visual STEM Perception for MLLMs

arxiv url: http://arxiv.org/abs/2603.10757v1
Date: Wed, 11 Mar 2026 13:32:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-12 16:22:32.965768
Title: CodePercept: Code-Grounded Visual STEM Perception for MLLMs
Title（参考訳）: CodePercept: MLLMのためのコード収集型ビジュアルSTEM知覚
Authors: Tongkun Guan, Zhibo Yang, Jianqiang Wan, Mingkun Yang, Zhengtao Guo, Zijian Hu, Ruilin Luo, Ruize Chen, Songtao Jiang, Peng Wang, Wei Shen, Junyang Lin, Xiaokang Yang,
Abstract要約: 本研究は,強力な知覚媒体としてのコードを確立することにより,MLLMの知覚能力を体系的に向上することに焦点を当てる。具体的には、1Mイメージ・キャプション・コード・トリプルからなる大規模データセットであるICC-1Mを構築した。さらに、STEMドメインの視覚的知覚を直接評価する新しいベンチマークであるSTEM2Code-Evalを紹介する。
参考スコア（独自算出の注目度）: 53.60065070334941
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: When MLLMs fail at Science, Technology, Engineering, and Mathematics (STEM) visual reasoning, a fundamental question arises: is it due to perceptual deficiencies or reasoning limitations? Through systematic scaling analysis that independently scales perception and reasoning components, we uncover a critical insight: scaling perception consistently outperforms scaling reasoning. This reveals perception as the true lever limiting current STEM visual reasoning. Motivated by this insight, our work focuses on systematically enhancing the perception capabilities of MLLMs by establishing code as a powerful perceptual medium--executable code provides precise semantics that naturally align with the structured nature of STEM visuals. Specifically, we construct ICC-1M, a large-scale dataset comprising 1M Image-Caption-Code triplets that materializes this code-as-perception paradigm through two complementary approaches: (1) Code-Grounded Caption Generation treats executable code as ground truth for image captions, eliminating the hallucinations inherent in existing knowledge distillation methods; (2) STEM Image-to-Code Translation prompts models to generate reconstruction code, mitigating the ambiguity of natural language for perception enhancement. To validate this paradigm, we further introduce STEM2Code-Eval, a novel benchmark that directly evaluates visual perception in STEM domains. Unlike existing work relying on problem-solving accuracy as a proxy that only measures problem-relevant understanding, our benchmark requires comprehensive visual comprehension through executable code generation for image reconstruction, providing deterministic and verifiable assessment. Code is available at https://github.com/TongkunGuan/Qwen-CodePercept.
Abstract（参考訳）: MLLMがSTEM(Science, Technology, Engineering, and Mathematics)の視覚的推論で失敗すると、根本的な疑問が生まれます。独立して知覚と推論コンポーネントをスケールする体系的なスケーリング分析を通じて、私たちは重要な洞察を見出します。これにより、現在のSTEM視覚的推論を制限する真のレバーとしての認識が明らかになる。本研究は,STEM視覚の構造的性質と自然に整合したセマンティックスを提供する,強力な知覚媒体としてコードを確立することで,MLLMの知覚能力を体系的に向上することに焦点を当てる。具体的には,(1)コード・グラウンド・キャプション生成(Code-Grounded Caption Generation)は,既存の知識蒸留法に固有の幻覚を排除し,(2)STEMイメージ・トゥ・コード翻訳(STEM)はモデルに再構成コードの生成を促し,知覚強調のための自然言語のあいまいさを緩和する,という2つの補完的なアプローチを通じて,このコード・アズ・パーセプションパラダイムを具体化する大規模データセットであるICC-1Mを構築した。このパラダイムを検証するために、STEMドメインの視覚的知覚を直接評価する新しいベンチマークであるSTEM2Code-Evalを導入する。問題関連理解のみを測定するプロキシとして問題解決精度に依存している既存の作業とは異なり、我々のベンチマークでは、画像再構成のための実行可能なコード生成を通じて包括的な視覚的理解が必要であり、決定論的かつ検証可能な評価を提供する。コードはhttps://github.com/TongkunGuan/Qwen-CodePerceptで入手できる。

論文の概要: CodePercept: Code-Grounded Visual STEM Perception for MLLMs

関連論文リスト