Fugu-MT 論文翻訳(概要): CiQi-Agent: Aligning Vision, Tools and Aesthetics in Multimodal Agent for Cultural Reasoning on Chinese Porcelains

論文の概要: CiQi-Agent: Aligning Vision, Tools and Aesthetics in Multimodal Agent for Cultural Reasoning on Chinese Porcelains

arxiv url: http://arxiv.org/abs/2603.28474v1
Date: Mon, 30 Mar 2026 14:13:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-31 23:18:45.436833
Title: CiQi-Agent: Aligning Vision, Tools and Aesthetics in Multimodal Agent for Cultural Reasoning on Chinese Porcelains
Title（参考訳）: CiQi-Agent:中国陶磁器のマルチモーダル・エージェントにおける視覚・道具・審美
Authors: Wenhan Wang, Zhixiang Zhou, Zhongtian Ma, Yanzhu Chen, Ziyu Lin, Hao Sheng, Pengfei Liu, Honglin Ma, Wenqi Shao, Qiaosheng Zhang, Yu Qiao,
Abstract要約: 我々は,中国産陶磁器のインテリジェント分析を目的としたドメイン固有の陶磁器製造エージェントであるCiQi-Agentを紹介する。 CiQi-Agentはマルチイメージの磁器入力をサポートし、視覚ツールの呼び出しとマルチモーダル検索拡張生成を可能にする。我々は,29,596個の磁器標本,51,553枚の画像,57,940個の視覚的質問応答対からなる,大規模で専門家による注釈付きデータセットCiQi-VQAを構築した。
参考スコア（独自算出の注目度）: 57.25643971175172
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The connoisseurship of antique Chinese porcelain demands extensive historical expertise, material understanding, and aesthetic sensitivity, making it difficult for non-specialists to engage. To democratize cultural-heritage understanding and assist expert connoisseurship, we introduce CiQi-Agent -- a domain-specific Porcelain Connoisseurship Agent for intelligent analysis of antique Chinese porcelain. CiQi-Agent supports multi-image porcelain inputs and enables vision tool invocation and multimodal retrieval-augmented generation, performing fine-grained connoisseurship analysis across six attributes: dynasty, reign period, kiln site, glaze color, decorative motif, and vessel shape. Beyond attribute classification, it captures subtle visual details, retrieves relevant domain knowledge, and integrates visual and textual evidence to produce coherent, explainable connoisseurship descriptions. To achieve this capability, we construct a large-scale, expert-annotated dataset CiQi-VQA, comprising 29,596 porcelain specimens, 51,553 images, and 557,940 visual question--answering pairs, and further establish a comprehensive benchmark CiQi-Bench aligned with the previously mentioned six attributes. CiQi-Agent is trained through supervised fine-tuning, reinforcement learning, and a tool-augmented reasoning framework that integrates two categories of tools: a vision tool and multimodal retrieval tools. Experimental results show that CiQi-Agent (7B) outperforms all competitive open- and closed-source models across all six attributes on CiQi-Bench, achieving on average 12.2\% higher accuracy than GPT-5. The model and dataset have been released and are publicly available at https://huggingface.co/datasets/SII-Monument-Valley/CiQi-VQA.
Abstract（参考訳）: 古代中国の陶磁器の良心は、歴史的専門知識、資料の理解、審美的感受性を必要としており、非専門主義者の関与を困難にしている。文化遺産の理解を民主化し,専門家の良心を支援するために,古代中国陶磁器のインテリジェントな分析を目的としたドメイン固有の陶磁器製造エージェントであるCiQi-Agentを紹介した。 CiQi-Agentは、マルチイメージの磁器入力をサポートし、視覚ツールの呼び出しとマルチモーダル検索拡張生成を可能にし、王朝、統治期、キルンサイト、ライズカラー、装飾モチーフ、血管形状の6つの属性にわたるきめ細かなconnoisseurship解析を行う。属性分類以外にも、微妙な視覚的詳細をキャプチャし、関連するドメイン知識を検索し、視覚的およびテキスト的証拠を統合して、一貫性のある説明可能な簡潔な記述を生成する。この能力を実現するために,29,596個の磁器標本,51,553枚の画像,57,940個の視覚的質問応答対からなる大規模で専門家による注釈付きデータセットCiQi-VQAを構築し,さらに,前述の6つの属性に沿った総合的なベンチマークCiQi-Benchを構築した。 CiQi-Agentは、教師付き微調整、強化学習、ツール拡張推論フレームワークを通じて、視覚ツールとマルチモーダル検索ツールの2つのカテゴリを統合してトレーニングされている。実験の結果、CiQi-Agent (7B) はCiQi-Bench上の6つの属性全てで競合するオープンソースモデルやクローズドソースモデルよりも優れており、GPT-5よりも平均12.2\%高い精度で達成されている。モデルとデータセットがリリースされ、https://huggingface.co/datasets/SII-Monument-Valley/CiQi-VQAで公開されている。

論文の概要: CiQi-Agent: Aligning Vision, Tools and Aesthetics in Multimodal Agent for Cultural Reasoning on Chinese Porcelains

関連論文リスト