Fugu-MT 論文翻訳(概要): A Cross-Model VLM-Judge Protocol for Single-Image 3D Mesh Quality (and Why Cheap Proxies Fall Short)

論文の概要: A Cross-Model VLM-Judge Protocol for Single-Image 3D Mesh Quality (and Why Cheap Proxies Fall Short)

arxiv url: http://arxiv.org/abs/2606.18451v1
Date: Tue, 16 Jun 2026 20:00:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-18 17:16:50.879341
Title: A Cross-Model VLM-Judge Protocol for Single-Image 3D Mesh Quality (and Why Cheap Proxies Fall Short)
Title（参考訳）: シングルイメージ3Dメッシュ品質のためのクロスモデルVLM-Judgeプロトコル(そしてなぜチーププロキシが不足するのか)
Authors: Ali Asaria, Tony Salomone, Deep Gandhi,
Abstract要約: シングルイメージから3Dジェネレータは急速に改善されている。 1つの生成されたメッシュが他のメッシュよりも優れているかどうかを判断する、合意された、人間の自由な方法はありません。再現可能なVLM-judge評価プロトコルを提案し,検証する。
参考スコア（独自算出の注目度）: 0.08599681538174887
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Single-image-to-3D generators are improving quickly, but there is no agreed, human-free way to tell whether one generated mesh is better than another. Practitioners commonly rely on cheap automatic proxies (render-space CLIP similarity and mesh geometry-validity statistics), yet how well these track perceived quality is unestablished. We make two contributions. First, we propose and validate a reproducible VLM-judge evaluation protocol: a fixed 24-view headless render rig, two independent vision-language judge families, and a mandatory position-bias correction that queries both presentation orders and keeps only order-consistent verdicts. The two judge families agree substantially with each other (Cohen's kappa = 0.66), well above the chance-agreement floor. Second, using this protocol as the reference, we show the cheap proxies do not substitute for it. Geometry validity is only a weak signal on average (because, as we show, it is bimodal) and stays below our pre-registered target, while render-CLIP is at chance. A learned Bradley-Terry head collapses onto a single manifoldness statistic (giving render-CLIP a negative weight) and matches geometry-only exactly, so learning the feature weights buys nothing. The proxy is also bimodal: it is significantly above chance on contrasts with visible geometric defects but at chance on ambiguous contrasts, consistent with geometry validity tracking the judge only when the defect is visually salient. We therefore recommend the VLM-judge protocol as a reliable, reproducible evaluator under the conditions tested (two feed-forward generators on Google Scanned Objects, with a face-drop degradation regime) and advise against geometry/CLIP proxies as optimization targets.
Abstract（参考訳）: シングルイメージから3Dジェネレータは急速に改善されているが、1つの生成されたメッシュが他のメッシュよりも優れているかどうかを判断する、合意された、人間の自由な方法はない。実践者は一般的に、安価な自動プロキシ(render-space CLIP類似性とメッシュ幾何値統計)に頼っているが、これらのトラック品質がどの程度確立されていないかは定かではない。私たちは2つの貢献をします。まず、24-viewのヘッドレスレンダリングリグと2つの独立した視覚言語判断ファミリと、両方の提示順序を問う必須位置バイアス補正という、再現可能なVLM-judge評価プロトコルを提案し、検証する。 2つの裁判官族は互いに実質的に一致している(コーエンのカッパ=0.66)。第二に、このプロトコルを基準として、安価なプロキシがそれに代わるものではないことを示す。幾何学的妥当性は平均的に弱い信号である(なぜならそれはバイモーダルだから)。学習したBradley-Terryの頭は、単一の多様体性統計量(レンダリング-CLIPを負の重みとする)に崩壊し、幾何学のみと正確に一致するので、特徴量を学ぶことは何も得ない。このプロキシはまたバイモーダルでもあり、目に見える幾何学的欠陥との対比では著しく上回っているが、不明瞭なコントラストでは、その欠陥が視覚的に正当である場合にのみ、判断者を追跡する幾何学的妥当性と一致している。そこで我々は,VLM-judgeプロトコルをテスト対象条件(Google Scanned Objects上の2つのフィードフォワードジェネレータ,顔画像劣化機構)下で信頼性の高い再現可能な評価器として推奨し,最適化ターゲットとして幾何学/CLIPプロキシに対して助言する。

論文の概要: A Cross-Model VLM-Judge Protocol for Single-Image 3D Mesh Quality (and Why Cheap Proxies Fall Short)

関連論文リスト