Scaling Capability in Token Space: An Analysis of Large Vision Language Model
- URL: http://arxiv.org/abs/2412.18387v2
- Date: Mon, 30 Dec 2024 11:00:35 GMT
- Title: Scaling Capability in Token Space: An Analysis of Large Vision Language Model
- Authors: Tenghui Li, Guoxu Zhou, Xuyang Zhao, Qibin Zhao,
- Abstract summary: We investigate the relationship between the number of vision tokens and the performance on vision-language models.<n>We also investigate the impact of a fusion mechanism that integrates the user's question with vision tokens.
- Score: 27.59879939490807
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The scaling capability has been widely validated in neural language models with respect to the number of parameters and the size of training data. One important question is that does the scaling capability also exists similarly with respect to the number of vision tokens in large vision language Model? This study fills the gap by investigating the relationship between the number of vision tokens and the performance on vision-language models. Our theoretical analysis and empirical evaluations demonstrate that the model exhibits scalable performance \(S(N_l)\) with respect to the number of vision tokens \(N_l\), characterized by the relationship \(S(N_l) \approx (c/N_l)^{\alpha}\). Furthermore, we also investigate the impact of a fusion mechanism that integrates the user's question with vision tokens. The results reveal two key findings. First, the scaling capability remains intact with the incorporation of the fusion mechanism. Second, the fusion mechanism enhances model performance, particularly when the user's question is task-specific and relevant. The analysis, conducted on fifteen diverse benchmarks spanning a broad range of tasks and domains, validates the effectiveness of the proposed approach.
Related papers
- Causal-Inspired Multitask Learning for Video-Based Human Pose Estimation [18.826857684901118]
We introduce a causal-temporal modeling framework consisting of two stages.
The first stage endows the model with causal-temporal modeling ability by introducing two self-supervision auxiliary tasks.
In the second stage, we argue that not all feature tokens contribute equally to pose estimation.
Our method outperforms state-of-the-art methods on three large-scale benchmark datasets.
arXiv Detail & Related papers (2025-01-24T09:45:16Z) - Core Context Aware Attention for Long Context Language Modeling [50.774702091154204]
We propose a plug-and-play Core Context Aware (CCA) Attention for efficient long-range context modeling.<n>Our CCA-Attention significantly outperforms state-of-the-art models in terms of computational efficiency and long-context modeling ability.
arXiv Detail & Related papers (2024-12-17T01:54:08Z) - Efficient Multi-modal Large Language Models via Visual Token Grouping [55.482198808206284]
High-resolution images and videos pose a barrier to their broader adoption.<n> compressing vision tokens in MLLMs has emerged as a promising approach to reduce inference costs.<n>We introduce VisToG, a novel grouping mechanism that leverages the capabilities of pre-trained vision encoders to group similar image segments.
arXiv Detail & Related papers (2024-11-26T09:36:02Z) - VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation [66.00245701441547]
We introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens.
Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.
arXiv Detail & Related papers (2024-08-29T17:21:58Z) - Zero-Shot Embeddings Inform Learning and Forgetting with Vision-Language Encoders [6.7181844004432385]
The Inter-Intra Modal Measure (IIMM) functions as a strong predictor of performance changes with fine-tuning.
Fine-tuning on tasks with higher IIMM scores produces greater in-domain performance gains but also induces more severe out-of-domain performance degradation.
With only a single forward pass of the target data, practitioners can leverage this key insight to evaluate the degree to which a model can be expected to improve following fine-tuning.
arXiv Detail & Related papers (2024-07-22T15:35:09Z) - On Efficient Language and Vision Assistants for Visually-Situated Natural Language Understanding: What Matters in Reading and Reasoning [33.89483627891117]
Recent advancements in language and vision assistants have showcased impressive capabilities but suffer from a lack of transparency.
Open-source models handle general image tasks effectively, but face challenges with the high computational demands of complex visually-situated text understanding.
This study aims to redefine the design of vision-language models by identifying key components and creating efficient models with constrained inference costs.
arXiv Detail & Related papers (2024-06-17T17:57:30Z) - VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs.
Existing benchmarks are often limited in scope, focusing mainly on object hallucinations.
We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z) - Corpus Considerations for Annotator Modeling and Scaling [9.263562546969695]
We show that the commonly used user token model consistently outperforms more complex models.
Our findings shed light on the relationship between corpus statistics and annotator modeling performance.
arXiv Detail & Related papers (2024-04-02T22:27:24Z) - Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset.
We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding.
Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z) - SeiT++: Masked Token Modeling Improves Storage-efficient Training [36.95646819348317]
Recent advancements in Deep Neural Network (DNN) models have significantly improved performance across computer vision tasks.
achieving highly generalizable and high-performing vision models requires expansive datasets, resulting in significant storage requirements.
Recent breakthrough by SeiT proposed the use of Vector-Quantized (VQ) feature vectors (i.e., tokens) as network inputs for vision classification.
In this paper, we extend SeiT by integrating Masked Token Modeling (MTM) for self-supervised pre-training.
arXiv Detail & Related papers (2023-12-15T04:11:34Z) - Scaling Vision-Language Models with Sparse Mixture of Experts [128.0882767889029]
We show that mixture-of-experts (MoE) techniques can achieve state-of-the-art performance on a range of benchmarks over dense models of equivalent computational cost.
Our research offers valuable insights into stabilizing the training of MoE models, understanding the impact of MoE on model interpretability, and balancing the trade-offs between compute performance when scaling vision-language models.
arXiv Detail & Related papers (2023-03-13T16:00:31Z) - Task Formulation Matters When Learning Continually: A Case Study in
Visual Question Answering [58.82325933356066]
Continual learning aims to train a model incrementally on a sequence of tasks without forgetting previous knowledge.
We present a detailed study of how different settings affect performance for Visual Question Answering.
arXiv Detail & Related papers (2022-09-30T19:12:58Z) - An Empirical Investigation of Commonsense Self-Supervision with
Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models.
We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z) - Generative Counterfactuals for Neural Networks via Attribute-Informed
Perturbation [51.29486247405601]
We design a framework to generate counterfactuals for raw data instances with the proposed Attribute-Informed Perturbation (AIP)
By utilizing generative models conditioned with different attributes, counterfactuals with desired labels can be obtained effectively and efficiently.
Experimental results on real-world texts and images demonstrate the effectiveness, sample quality as well as efficiency of our designed framework.
arXiv Detail & Related papers (2021-01-18T08:37:13Z) - Bayesian Sparse Factor Analysis with Kernelized Observations [67.60224656603823]
Multi-view problems can be faced with latent variable models.
High-dimensionality and non-linear issues are traditionally handled by kernel methods.
We propose merging both approaches into single model.
arXiv Detail & Related papers (2020-06-01T14:25:38Z) - Feature Importance Estimation with Self-Attention Networks [0.0]
Black-box neural network models are widely used in industry and science, yet are hard to understand and interpret.
Recently, the attention mechanism was introduced, offering insights into the inner workings of neural language models.
This paper explores the use of attention-based neural networks mechanism for estimating feature importance, as means for explaining the models learned from propositional (tabular) data.
arXiv Detail & Related papers (2020-02-11T15:15:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.