On Data Synthesis and Post-training for Visual Abstract Reasoning
- URL: http://arxiv.org/abs/2504.01324v1
- Date: Wed, 02 Apr 2025 03:18:24 GMT
- Title: On Data Synthesis and Post-training for Visual Abstract Reasoning
- Authors: Ke Zhu, Yu Wang, Jiangjiang Liu, Qunyi Xie, Shanshan Liu, Gang Zhang,
- Abstract summary: We make a common LLaVANeXT 7B model capable of perceiving and reasoning about specific problems.<n>This is a great breakthrough since almost all previous VLMs fail or show nearly random performance on representative benchmarks.
- Score: 15.055924556135857
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper is a pioneering work attempting to address abstract visual reasoning (AVR) problems for large vision-language models (VLMs). We make a common LLaVA-NeXT 7B model capable of perceiving and reasoning about specific AVR problems, surpassing both open-sourced (e.g., Qwen-2-VL-72B) and closed-sourced powerful VLMs (e.g., GPT-4o) with significant margin. This is a great breakthrough since almost all previous VLMs fail or show nearly random performance on representative AVR benchmarks. Our key success is our innovative data synthesis and post-training process, aiming to fully relieve the task difficulty and elicit the model to learn, step by step. Our 7B model is also shown to be behave well on AVR without sacrificing common multimodal comprehension abilities. We hope our paper could serve as an early effort in this area and would inspire further research in abstract visual reasoning.
Related papers
- CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models [89.44024245194315]
We introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs)<n>We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens.<n>Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks.
arXiv Detail & Related papers (2025-03-27T22:23:04Z) - LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning [76.82159851648711]
We propose a framework that dynamically improves the embedding model's representation learning for negative pairs.
LLaVE establishes stronger baselines that achieve state-of-the-art (SOTA) performance.
LLaVE can generalize to text-video retrieval tasks in a zero-shot manner and achieve strong performance.
arXiv Detail & Related papers (2025-03-04T10:21:57Z) - HAMSTER: Hierarchical Action Models For Open-World Robot Manipulation [54.03004125910057]
We show that hierarchical vision-language-action models can be more effective in utilizing off-domain data than standard monolithic VLA models.
We show that, with the hierarchical design, the high-level VLM can transfer across significant domain gaps between the off-domain finetuning data and real-robot testing scenarios.
arXiv Detail & Related papers (2025-02-08T07:50:22Z) - CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation [100.25567121604382]
Vision-Language-Action (VLA) models have improved robotic manipulation in terms of language-guided task execution and generalization to unseen scenarios.
We present a new advanced VLA architecture derived from Vision-Language-Models (VLM)
We show that our model not only significantly surpasses existing VLAs in task performance and but also exhibits remarkable adaptation to new robots and generalization to unseen objects and backgrounds.
arXiv Detail & Related papers (2024-11-29T12:06:03Z) - ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models [55.07988373824348]
We study the visual generalization capabilities of three existing robotic foundation models.<n>Our study shows that the existing models do not exhibit robustness to visual out-of-domain scenarios.<n>We propose a gradual backbone reversal approach founded on model merging.
arXiv Detail & Related papers (2024-09-23T17:47:59Z) - TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation [32.406783380729024]
Vision-Language-Action (VLA) models have shown remarkable potential in visuomotor control and instruction comprehension through end-to-end learning processes.
Current VLA models face significant challenges: they are slow during inference and require extensive pre-training on large amounts of robotic data.
We introduce a new family of compact vision-language-action models, called TinyVLA, which offers two key advantages over existing VLA models.
arXiv Detail & Related papers (2024-09-19T07:10:18Z) - A Unified View of Abstract Visual Reasoning Problems [0.0]
We introduce a unified view of tasks, where each instance is rendered as a single image with no priori assumptions about the number of panels, their location, or role.
The main advantage of the proposed unified view is the ability to develop universal learning models applicable to various tasks.
Experiments conducted on four datasets with Raven's Progressive Matrices and Visual Analogy Problems show that the proposed unified representation of tasks poses a challenge to state-of-the-art Deep Learning (DL) models and, more broadly, contemporary DL image recognition methods.
arXiv Detail & Related papers (2024-06-16T20:52:44Z) - OpenVLA: An Open-Source Vision-Language-Action Model [131.74098076670103]
We introduce OpenVLA, an open-source VLA trained on a diverse collection of 970k real-world robot demonstrations.
OpenVLA shows strong results for generalist manipulation, outperforming closed models such as RT-2-X (55B) by 16.5% in absolute task success rate.
We release model checkpoints, fine-tuning notebooks, and our PyTorch with built-in support for training VLAs at scale on Open X-Embodiment datasets.
arXiv Detail & Related papers (2024-06-13T15:46:55Z) - One Self-Configurable Model to Solve Many Abstract Visual Reasoning
Problems [0.0]
We propose a unified model for solving Single-Choice Abstract visual Reasoning tasks.
The proposed model relies on SCAR-Aware dynamic Layer (SAL), which adapts its weights to the structure of the problem.
Experiments show thatSAL-based models, in general, effectively solves diverse tasks, and its performance is on par with the state-of-the-art task-specific baselines.
arXiv Detail & Related papers (2023-12-15T18:15:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.