BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models
- URL: http://arxiv.org/abs/2512.10932v1
- Date: Thu, 11 Dec 2025 18:57:05 GMT
- Title: BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models
- Authors: Shengao Wang, Wenqi Wang, Zecheng Wang, Max Whitton, Michael Wakeham, Arjun Chandra, Joey Huang, Pengyue Zhu, Helen Chen, David Li, Jeffrey Li, Shawn Li, Andrew Zagula, Amy Zhao, Andrew Zhu, Sayaka Nakamura, Yuki Yamamoto, Jerry Jun Yokono, Aaron Mueller, Bryan A. Plummer, Kate Saenko, Venkatesh Saligrama, Boqing Gong,
- Abstract summary: We introduce BabyVLM-V2, a developmentally grounded framework for infant-inspired vision-language modeling.<n>The pretraining set maximizes coverage while minimizing curation of a longitudinal, infant-centric audiovisual corpus.<n>DevCV Toolbox adapts all vision-related measures of the recently released NIH Baby Toolbox into a benchmark suite of ten multimodal tasks.
- Score: 69.84938298826121
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Early children's developmental trajectories set up a natural goal for sample-efficient pretraining of vision foundation models. We introduce BabyVLM-V2, a developmentally grounded framework for infant-inspired vision-language modeling that extensively improves upon BabyVLM-V1 through a longitudinal, multifaceted pretraining set, a versatile model, and, most importantly, DevCV Toolbox for cognitive evaluation. The pretraining set maximizes coverage while minimizing curation of a longitudinal, infant-centric audiovisual corpus, yielding video-utterance, image-utterance, and multi-turn conversational data that mirror infant experiences. DevCV Toolbox adapts all vision-related measures of the recently released NIH Baby Toolbox into a benchmark suite of ten multimodal tasks, covering spatial reasoning, memory, and vocabulary understanding aligned with early children's capabilities. Experimental results show that a compact model pretrained from scratch can achieve competitive performance on DevCV Toolbox, outperforming GPT-4o on some tasks. We hope the principled, unified BabyVLM-V2 framework will accelerate research in developmentally plausible pretraining of vision foundation models.
Related papers
- A Training-Free Guess What Vision Language Model from Snippets to Open-Vocabulary Object Detection [16.166979262501425]
Open-Vocabulary Object Detection (OVOD) aims to develop the capability to detect anything.<n>In this paper, a training-free Guess What Vision Language Model is proposed to form a universal understanding paradigm.<n>Our proposed GW-VLM can achieve superior OVOD performance compared to the-state-of-the-art methods without any training step.
arXiv Detail & Related papers (2026-01-17T05:14:42Z) - VITAL: Vision-Encoder-centered Pre-training for LMMs in Visual Quality Assessment [88.83260031198023]
We propose a vision-encoder-centered generative pre-training pipeline and develop the VITAL-Series LMMs.<n>We construct over 4.5M vision-language (VL) pairs-the largest VQualA training dataset to date.<n>We employ a multi-task training workflow that simultaneously enhances the model's quantitative scoring precision.
arXiv Detail & Related papers (2025-11-22T07:55:21Z) - Rethinking Visual Intelligence: Insights from Video Pretraining [75.32388528274224]
Large language models (LLMs) have demonstrated that large-scale pretraining enables systems to adapt rapidly to new problems.<n>We investigate Video Diffusion Models (VDMs) as a promising direction for bridging the gap.
arXiv Detail & Related papers (2025-10-28T14:12:11Z) - Kwai Keye-VL Technical Report [80.53170317017147]
We introduce textbfKwai Keye-VL, a multimodal foundation model for short-video understanding.<n>The development of Keye-VL rests on two core pillars: a massive, high-quality dataset with a strong emphasis on video, and an innovative training recipe.<n>To validate our approach, we conduct extensive evaluations, showing that Keye-VL achieves state-of-the-art results on public video benchmarks and remains highly competitive on general image-based tasks.
arXiv Detail & Related papers (2025-07-02T17:57:28Z) - BabyVLM: Data-Efficient Pretraining of VLMs Inspired by Infant Learning [33.64851748019174]
Human infants rapidly develop visual reasoning skills from minimal input.<n>Recent efforts have leveraged infant-inspired datasets like SAYCam.<n>We propose BabyVLM, a novel framework comprising comprehensive in-domain evaluation benchmarks and a synthetic training dataset.
arXiv Detail & Related papers (2025-04-13T04:17:12Z) - Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models via Vision-Guided Reinforcement Learning [26.14137626882127]
Large Vision-Language Models (LVLMs) typically follow a two-stage training paradigm-pretraining and supervised fine-tuning.<n> preference optimization, derived from the language domain, has emerged as an effective post-training reinforcement strategy.<n>We propose Vision-R1, a novel vision-guided R1-like reinforcement learning algorithm for LVLMs that rewards models with definitive vision feedback.
arXiv Detail & Related papers (2025-03-23T10:21:14Z) - ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models [55.07988373824348]
We study the visual generalization capabilities of three existing robotic foundation models.<n>Our study shows that the existing models do not exhibit robustness to visual out-of-domain scenarios.<n>We propose a gradual backbone reversal approach founded on model merging.
arXiv Detail & Related papers (2024-09-23T17:47:59Z) - Harnessing Vision-Language Pretrained Models with Temporal-Aware Adaptation for Referring Video Object Segmentation [34.37450315995176]
Current Referring Video Object (RVOS) methods typically use vision and language models pretrained independently as backbones.
We propose a temporal-aware prompt-tuning method, which adapts pretrained representations for pixel-level prediction.
Our method performs favorably against state-of-the-art algorithms and exhibits strong generalization abilities.
arXiv Detail & Related papers (2024-05-17T08:14:22Z) - Unsupervised Vision-and-Language Pre-training Without Parallel Images
and Captions [92.47566804182338]
We investigate if a strong V&L representation model can be learned through unsupervised pre-training without image-caption corpora.
In particular, we propose to conduct mask-and-predict'' pre-training on text-only and image-only corpora.
We find that such a simple approach performance close to a model pre-trained with aligned data, on four English V&L benchmarks.
arXiv Detail & Related papers (2020-10-24T08:17:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.