Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence
- URL: http://arxiv.org/abs/2502.09927v1
- Date: Fri, 14 Feb 2025 05:36:32 GMT
- Title: Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence
- Authors: Granite Vision Team, Leonid Karlinsky, Assaf Arbelle, Abraham Daniels, Ahmed Nassar, Amit Alfassi, Bo Wu, Eli Schwartz, Dhiraj Joshi, Jovana Kondic, Nimrod Shabtay, Pengyuan Li, Roei Herzig, Shafiq Abedin, Shaked Perek, Sivan Harary, Udi Barzelay, Adi Raz Goldfarb, Aude Oliva, Ben Wieles, Bishwaranjan Bhattacharjee, Brandon Huang, Christoph Auer, Dan Gutfreund, David Beymer, David Wood, Hilde Kuehne, Jacob Hansen, Joseph Shtok, Ken Wong, Luis Angel Bathen, Mayank Mishra, Maksym Lysak, Michele Dolfi, Mikhail Yurochkin, Nikolaos Livathinos, Nimrod Harel, Ophir Azulai, Oshri Naparstek, Rafael Teixeira de Lima, Rameswar Panda, Sivan Doveh, Shubham Gupta, Subhro Das, Syed Zawad, Yusik Kim, Zexue He, Alexander Brooks, Gabe Goodhart, Anita Govindjee, Derek Leist, Ibrahim Ibrahim, Aya Soffer, David Cox, Kate Soule, Luis Lastras, Nirmit Desai, Shila Ofek-koifman, Sriram Raghavan, Tanveer Syeda-Mahmood, Peter Staar, Tal Drory, Rogerio Feris,
- Abstract summary: We introduce Granite Vision, a lightweight large language model with vision capabilities.
Our model is trained on a comprehensive instruction-following dataset.
Granite Vision achieves strong results in standard benchmarks related to visual document understanding.
- Score: 88.74800617923083
- License:
- Abstract: We introduce Granite Vision, a lightweight large language model with vision capabilities, specifically designed to excel in enterprise use cases, particularly in visual document understanding. Our model is trained on a comprehensive instruction-following dataset, including document-related tasks, such as content extraction from tables, charts, diagrams, sketches, and infographics, as well as general image tasks. The architecture of Granite Vision is centered around visual modality alignment with a decoder-only, 2 billion parameter Granite large language model. Additionally, we introduce a dedicated safety classification approach in test-time that leverages a sparse set of attention vectors to identify potential harmful inputs. Despite its lightweight architecture, Granite Vision achieves strong results in standard benchmarks related to visual document understanding, as well as on the LiveXiv benchmark, which is designed to avoid test set contamination by using a constantly updated corpus of recently published Arxiv papers. We are releasing the model under the Apache-2 license, allowing for both research and commercial use, while offering complete visibility into the training data and other relevant details. See https://huggingface.co/ibm-granite/ for model weights.
Related papers
- Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [61.143381152739046]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.
We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC)
This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions.
In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z) - VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks [89.24440488456405]
VisionLLM v2 is an end-to-end generalist multimodal large model (MLLM)
It unifies visual perception, understanding, and generation within a single framework.
arXiv Detail & Related papers (2024-06-12T16:44:50Z) - Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring [27.45225442048711]
We introduce a unified high-resolution generalist model, Griffon v2, enabling flexible object referring with visual and textual prompts.
We design a simple and lightweight down-sampling projector to overcome the input tokens constraint in Large Language Models.
Experiments demonstrate that Griffon v2 can localize any objects of interest with visual and textual referring, achieve state-of-the-art performance on REC, phrase grounding, and REG tasks, and outperform expert models in object detection and object counting.
arXiv Detail & Related papers (2024-03-14T12:21:37Z) - Annotation Free Semantic Segmentation with Vision Foundation Models [11.026377387506216]
We generate free annotations for any semantic segmentation dataset using existing foundation models.
We use CLIP to detect objects and SAM to generate high quality object masks.
Our approach can bring language-based semantics to any pretrained vision encoder with minimal training.
arXiv Detail & Related papers (2024-03-14T11:57:58Z) - InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists [66.85125112199898]
We develop a unified language interface for computer vision tasks that abstracts away task-specific design choices.
Our model, dubbed InstructCV, performs competitively compared to other generalist and task-specific vision models.
arXiv Detail & Related papers (2023-09-30T14:26:43Z) - Leveraging Contextual Information for Effective Entity Salience Detection [21.30389576465761]
We show that fine-tuning medium-sized language models with a cross-encoder style architecture yields substantial performance gains over feature engineering approaches.
We also show that zero-shot prompting of instruction-tuned language models yields inferior results, indicating the task's uniqueness and complexity.
arXiv Detail & Related papers (2023-09-14T19:04:40Z) - VisionLLM: Large Language Model is also an Open-Ended Decoder for
Vision-Centric Tasks [81.32968995346775]
VisionLLM is a framework for vision-centric tasks that can be flexibly defined and managed using language instructions.
Our model can achieve over 60% mAP on COCO, on par with detection-specific models.
arXiv Detail & Related papers (2023-05-18T17:59:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.