Fugu-MT 論文翻訳(概要): Vision Language Models: A Survey of 26K Papers

論文の概要: Vision Language Models: A Survey of 26K Papers

arxiv url: http://arxiv.org/abs/2510.09586v1
Date: Fri, 10 Oct 2025 17:43:17 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 00:38:49.505629
Title: Vision Language Models: A Survey of 26K Papers
Title（参考訳）: 視覚言語モデル:26K論文の調査
Authors: Fengming Lin,
Abstract要約: CVPR, ICLR, NeurIPSが2023～2025年に発行した論文26,104件を対象に, 透明かつ再現可能な研究動向の測定を行った。タイトルと要約は正規化され、フレーズで保護され、手作りのレキシコンと一致し、最大35のトピックラベルが割り当てられる。分析では,(1)指示と多段階推論として古典的認識を再構築する多モーダル視覚言語・LLM作品の急激な増加,(2)制御性,蒸留性,速度を中心とした拡散研究による生成手法の着実に拡張,(3)レジリエント3Dの3つのマクロシフトを定量化する。
参考スコア（独自算出の注目度）: 0.20305676256390928
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present a transparent, reproducible measurement of research trends across 26,104 accepted papers from CVPR, ICLR, and NeurIPS spanning 2023-2025. Titles and abstracts are normalized, phrase-protected, and matched against a hand-crafted lexicon to assign up to 35 topical labels and mine fine-grained cues about tasks, architectures, training regimes, objectives, datasets, and co-mentioned modalities. The analysis quantifies three macro shifts: (1) a sharp rise of multimodal vision-language-LLM work, which increasingly reframes classic perception as instruction following and multi-step reasoning; (2) steady expansion of generative methods, with diffusion research consolidating around controllability, distillation, and speed; and (3) resilient 3D and video activity, with composition moving from NeRFs to Gaussian splatting and a growing emphasis on human- and agent-centric understanding. Within VLMs, parameter-efficient adaptation like prompting/adapters/LoRA and lightweight vision-language bridges dominate; training practice shifts from building encoders from scratch to instruction tuning and finetuning strong backbones; contrastive objectives recede relative to cross-entropy/ranking and distillation. Cross-venue comparisons show CVPR has a stronger 3D footprint and ICLR the highest VLM share, while reliability themes such as efficiency or robustness diffuse across areas. We release the lexicon and methodology to enable auditing and extension. Limitations include lexicon recall and abstract-only scope, but the longitudinal signals are consistent across venues and years.
Abstract（参考訳）: CVPR, ICLR, NeurIPSが2023～2025年に発行した論文26,104件を対象に, 透明かつ再現可能な研究動向の測定を行った。タイトルと抽象語は正規化され、フレーズで保護され、最大35のトピックラベルを割り当て、タスク、アーキテクチャ、トレーニング体制、目的、データセット、そして協調したモダリティに関するきめ細かい手がかりを抽出するために手作りの語彙と一致する。分析では,(1)指示・多段階推論として古典的認識を徐々に再構築する多モーダル視覚言語・LLM作品の急激な増加,(2)制御可能性,蒸留,速度を中心に統合された拡散研究による生成手法の着実に拡張,(3)NeRFからガウススプラッティングへの合成,および人・エージェント中心の理解への重点化,の3つのマクロシフトを定量化する。 VLMでは、プロンプト/アダプタ/LoRAや軽量な視覚言語ブリッジのようなパラメータ効率の適応が支配的であり、訓練は、エンコーダをスクラッチからインストラクションチューニングに移行し、強力なバックボーンを微調整する。 CVPRは3Dフットプリントが強く、ICLRはVLMのシェアが最も高い一方で、効率性や堅牢性といった信頼性のテーマは地域によって拡散している。監査と拡張を可能にするためのレキシコンと方法論をリリースする。制限にはレキシコンリコールと抽象的のみのスコープが含まれるが、縦方向の信号は会場や数年にわたって一貫している。

論文の概要: Vision Language Models: A Survey of 26K Papers

関連論文リスト