OtterHD: A High-Resolution Multi-modality Model
- URL: http://arxiv.org/abs/2311.04219v1
- Date: Tue, 7 Nov 2023 18:59:58 GMT
- Title: OtterHD: A High-Resolution Multi-modality Model
- Authors: Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, Ziwei
Liu
- Abstract summary: OtterHD-8B is an innovative multimodal model engineered to interpret high-resolution visual inputs with granular precision.
Our study highlights the critical role of flexibility and high-resolution input capabilities in large multimodal models.
- Score: 57.16481886807386
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this paper, we present OtterHD-8B, an innovative multimodal model evolved
from Fuyu-8B, specifically engineered to interpret high-resolution visual
inputs with granular precision. Unlike conventional models that are constrained
by fixed-size vision encoders, OtterHD-8B boasts the ability to handle flexible
input dimensions, ensuring its versatility across various inference
requirements. Alongside this model, we introduce MagnifierBench, an evaluation
framework designed to scrutinize models' ability to discern minute details and
spatial relationships of small objects. Our comparative analysis reveals that
while current leading models falter on this benchmark, OtterHD-8B, particularly
when directly processing high-resolution inputs, outperforms its counterparts
by a substantial margin. The findings illuminate the structural variances in
visual information processing among different models and the influence that the
vision encoders' pre-training resolution disparities have on model
effectiveness within such benchmarks. Our study highlights the critical role of
flexibility and high-resolution input capabilities in large multimodal models
and also exemplifies the potential inherent in the Fuyu architecture's
simplicity for handling complex visual data.
Related papers
- Challenging the Performance-Interpretability Trade-off: An Evaluation of Interpretable Machine Learning Models [3.3595341706248876]
Generalized additive models (GAMs) offer promising properties for capturing complex, non-linear patterns while remaining fully interpretable.
This study examines the predictive performance of seven different GAMs in comparison to seven commonly used machine learning models based on a collection of twenty benchmark datasets.
arXiv Detail & Related papers (2024-09-22T12:58:52Z) - DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception [66.88792390480343]
We propose DEEM, a simple but effective approach that utilizes the generative feedback of diffusion models to align the semantic distributions of the image encoder.
DEEM exhibits enhanced robustness and a superior capacity to alleviate model hallucinations while utilizing fewer trainable parameters, less pre-training data, and a smaller base model size.
arXiv Detail & Related papers (2024-05-24T05:46:04Z) - What matters when building vision-language models? [52.8539131958858]
We develop Idefics2, an efficient foundational vision-language model with 8 billion parameters.
Idefics2 achieves state-of-the-art performance within its size category across various multimodal benchmarks.
We release the model (base, instructed, and chat) along with the datasets created for its training.
arXiv Detail & Related papers (2024-05-03T17:00:00Z) - Corpus Considerations for Annotator Modeling and Scaling [9.263562546969695]
We show that the commonly used user token model consistently outperforms more complex models.
Our findings shed light on the relationship between corpus statistics and annotator modeling performance.
arXiv Detail & Related papers (2024-04-02T22:27:24Z) - Visual Analytics for Generative Transformer Models [28.251218916955125]
We present a novel visual analytical framework to support the analysis of transformer-based generative networks.
Our framework is one of the first dedicated to supporting the analysis of transformer-based encoder-decoder models.
arXiv Detail & Related papers (2023-11-21T08:15:01Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - Scaling Vision-Language Models with Sparse Mixture of Experts [128.0882767889029]
We show that mixture-of-experts (MoE) techniques can achieve state-of-the-art performance on a range of benchmarks over dense models of equivalent computational cost.
Our research offers valuable insights into stabilizing the training of MoE models, understanding the impact of MoE on model interpretability, and balancing the trade-offs between compute performance when scaling vision-language models.
arXiv Detail & Related papers (2023-03-13T16:00:31Z) - Efficient Scopeformer: Towards Scalable and Rich Feature Extraction for
Intracranial Hemorrhage Detection [0.7734726150561088]
"Scopeformer" is a novel multi-CNN-ViT model for intracranial hemorrhage classification in computed tomography (CT) images.
We propose effective feature projection methods to reduce redundancies among CNN-generated features and to control the input size of ViTs.
Experiments with various Scopeformer models show that the model performance is proportional to the number of convolutional blocks employed in the feature extractor.
arXiv Detail & Related papers (2023-02-01T03:51:27Z) - Dynamic Latent Separation for Deep Learning [67.62190501599176]
A core problem in machine learning is to learn expressive latent variables for model prediction on complex data.
Here, we develop an approach that improves expressiveness, provides partial interpretation, and is not restricted to specific applications.
arXiv Detail & Related papers (2022-10-07T17:56:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.