Related papers: H2OVL-Mississippi Vision Language Models Technical Report

H2OVL-Mississippi Vision Language Models Technical Report

URL: http://arxiv.org/abs/2410.13611v1
Date: Thu, 17 Oct 2024 14:46:34 GMT
Title: H2OVL-Mississippi Vision Language Models Technical Report
Authors: Shaikat Galib, Shanshan Wang, Guanshuo Xu, Pascal Pfeiffer, Ryan Chesler, Mark Landry, Sri Satish Ambati,
Abstract summary: We present H2OVL-Mississippi, a pair of small vision-language models trained on 37 million image-text pairs. H2OVL-Mississippi-0.8B is a tiny model with 0.8 billion parameters that specializes in text recognition. We are releasing H2OVL-Mississippi-2B, a 2 billion parameter model for general use cases, exhibiting highly competitive metrics.
Score: 4.070560738863018
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Smaller vision-language models (VLMs) are becoming increasingly important for privacy-focused, on-device applications due to their ability to run efficiently on consumer hardware for processing enterprise commercial documents and images. These models require strong language understanding and visual capabilities to enhance human-machine interaction. To address this need, we present H2OVL-Mississippi, a pair of small VLMs trained on 37 million image-text pairs using 240 hours of compute on 8 x H100 GPUs. H2OVL-Mississippi-0.8B is a tiny model with 0.8 billion parameters that specializes in text recognition, achieving state of the art performance on the Text Recognition portion of OCRBench and surpassing much larger models in this area. Additionally, we are releasing H2OVL-Mississippi-2B, a 2 billion parameter model for general use cases, exhibiting highly competitive metrics across various academic benchmarks. Both models build upon our prior work with H2O-Danube language models, extending their capabilities into the visual domain. We release them under the Apache 2.0 license, making VLMs accessible to everyone, democratizing document AI and visual LLMs.

Related papers

A Pragmatic VLA Foundation Model [66.76609538850478]
We develop LingBot-VLA with around 20,000 hours of real-world data from 9 popular dual-arm robot configurations.<n>Our model achieves clear superiority over competitors, showcasing its strong performance and broad generalizability.<n>To advance the field of robot learning, we provide open access to the code, base model, and benchmark data.
arXiv Detail & Related papers (2026-01-26T17:08:04Z)
A Survey on Efficient Vision-Language Models [0.6597195879147555]
Vision-language models (VLMs) integrate visual and textual information, enabling a wide range of applications such as image captioning and visual question answering. High computational demands pose challenges for real-time applications. This has led to a growing focus on developing efficient vision language models.
arXiv Detail & Related papers (2025-04-13T21:12:24Z)
Improved Alignment of Modalities in Large Vision Language Models [1.4561960744147884]
We propose a training strategy of auto-regressive vision-language models. We propose four training stages for aligning the vision model with the language model. We also devise different attention masks for training transformer-based language models.
arXiv Detail & Related papers (2025-03-25T09:59:46Z)
VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM) VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety. Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z)
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks [60.5257456681402]
We build universal embedding models capable of handling a wide range of downstream tasks. Our contributions are twofold: (1) MMEB (Massive Multimodal Embedding Benchmark), which covers 4 meta-tasks (i.e. classification, visual question answering, multimodal retrieval, and visual grounding) and 36 datasets, including 20 training and 16 evaluation datasets, and (2) VLM2Vec (Vision-Language Model -> Vector), a contrastive training framework that converts any state-of-the-art vision-language model into an embedding model via training on MMEB.
arXiv Detail & Related papers (2024-10-07T16:14:05Z)
Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model [7.082567506213992]
We introduce Xmodel-VLM, a cutting-edge multimodal vision language model. It is designed for efficient deployment on consumer GPU servers.
arXiv Detail & Related papers (2024-05-15T09:47:59Z)
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites [114.22835695929682]
InternVL 1.5 is an open-source multimodal large language model (MLLM) It bridges the capability gap between open-source and proprietary commercial models in multimodal understanding.
arXiv Detail & Related papers (2024-04-25T17:59:19Z)
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices [73.46317110474064]
MobileVLM is a competent multimodal vision language model (MMVLM) targeted to run on mobile devices. It comprises a set of language models at the scale of 1.4B and 2.7B parameters, trained from scratch, a multimodal vision model that is pre-trained in the CLIP fashion.
arXiv Detail & Related papers (2023-12-28T08:21:24Z)
TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones [18.954681684239358]
This study introduces TinyGPT-V, a novel open-source MLLM, designed for efficient training and inference across various vision-language tasks. With its language model 2.8 billion parameters, TinyGPT-V achieves comparable results in VQA and image inference tasks to its larger counterparts.
arXiv Detail & Related papers (2023-12-28T07:11:41Z)
YAYI 2: Multilingual Open-Source Large Language Models [53.92832054643197]
We propose YAYI 2, including both base and chat models, with 30 billion parameters. YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline. The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-12-22T17:34:47Z)
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks [92.03764152132315]
We design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters. This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks. It has powerful visual capabilities and can be a good alternative to the ViT-22B.
arXiv Detail & Related papers (2023-12-21T18:59:31Z)
V$^2$L: Leveraging Vision and Vision-language Models into Large-scale Product Retrieval [32.28772179053869]
This paper introduces our 1st-place solution in eBay eProduct Visual Search Challenge (FGVC9) We show that combining the vision models and vision-language models brings particular benefits from their complementarity and is a key factor to our superiority.
arXiv Detail & Related papers (2022-07-26T15:53:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.