POINTS1.5: Building a Vision-Language Model towards Real World Applications
- URL: http://arxiv.org/abs/2412.08443v1
- Date: Wed, 11 Dec 2024 15:08:25 GMT
- Title: POINTS1.5: Building a Vision-Language Model towards Real World Applications
- Authors: Yuan Liu, Le Tian, Xiao Zhou, Xinyu Gao, Kavio Yu, Yang Yu, Jie Zhou,
- Abstract summary: We introduce a new vision-language model, POINTS1.5, designed to excel in various real-world applications.
We replace the original CLIP vision encoder, which had a fixed image resolution, with a NaViT-style vision encoder that supports native dynamic high resolution.
We add bilingual support to POINTS1.5, significantly enhancing its capability in Chinese.
- Score: 33.53291443466257
- License:
- Abstract: Vision-language models have made significant strides recently, demonstrating superior performance across a range of tasks, e.g. optical character recognition and complex diagram analysis. Building on this trend, we introduce a new vision-language model, POINTS1.5, designed to excel in various real-world applications. POINTS1.5 is an enhancement of POINTS1.0 and incorporates several key innovations: i) We replace the original CLIP vision encoder, which had a fixed image resolution, with a NaViT-style vision encoder that supports native dynamic high resolution. This allows POINTS1.5 to process images of any resolution without needing to split them into tiles. ii) We add bilingual support to POINTS1.5, significantly enhancing its capability in Chinese. Due to the scarcity of open-source Chinese datasets for vision-language models, we collect numerous images from the Internet and annotate them using a combination of manual and automatic methods. iii) We propose a set of rigorous filtering methods for visual instruction tuning datasets. We comprehensively evaluate all these filtering methods, and choose the most effective ones to obtain the final visual instruction tuning set. Thanks to these innovations, POINTS1.5 significantly outperforms POINTS1.0 and demonstrates strong performance across a range of real-world applications. Notably, POINTS1.5-7B is trained on fewer than 4 billion tokens and ranks first on the OpenCompass leaderboard among models with fewer than 10 billion parameters
Related papers
- MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning [74.34171839925114]
We present MM1.5, a new family of multimodal large language models (MLLMs)
Our models range from 1B to 30B parameters, encompassing both dense and mixture-of-experts (MoE) variants.
We provide detailed insights into the training processes and decisions that inform our final designs.
arXiv Detail & Related papers (2024-09-30T17:59:34Z) - LowCLIP: Adapting the CLIP Model Architecture for Low-Resource Languages in Multimodal Image Retrieval Task [0.0]
This research explores the development of vision-language models for image retrieval in low-resource languages, specifically Azerbaijani.
We integrated the CLIP model architecture and employed several techniques to balance computational efficiency with performance.
Our study found that models like EfficientNet0 and Tiny Swin Transformer perform best on the datasets they were trained on.
arXiv Detail & Related papers (2024-08-25T18:10:16Z) - VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC)
This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions.
In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z) - How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites [114.22835695929682]
InternVL 1.5 is an open-source multimodal large language model (MLLM)
It bridges the capability gap between open-source and proprietary commercial models in multimodal understanding.
arXiv Detail & Related papers (2024-04-25T17:59:19Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - APoLLo: Unified Adapter and Prompt Learning for Vision Language Models [58.9772868980283]
We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models.
APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
arXiv Detail & Related papers (2023-12-04T01:42:09Z) - UReader: Universal OCR-free Visually-situated Language Understanding
with Multimodal Large Language Model [108.85584502396182]
We propose UReader, a first exploration of universal OCR-free visually-situated language understanding based on the Multimodal Large Language Model (MLLM)
By leveraging the shallow text recognition ability of the MLLM, we only finetuned 1.2% parameters.
Our single model achieves state-of-the-art ocr-free performance in 8 out of 10 visually-situated language understanding tasks.
arXiv Detail & Related papers (2023-10-08T11:33:09Z) - SVIT: Scaling up Visual Instruction Tuning [26.794950789335402]
We build a dataset of 4.2 million visual instruction tuning data including 1.6M conversation question-answer (QA) pairs, 1.6M complex reasoning QA pairs, 1.0M referring QA pairs and 106K detailed image descriptions.
Experiments verify that SVIT-v1.5, trained on the proposed dataset, outperforms state-of-the-art Multimodal Large Language Models on popular benchmarks.
arXiv Detail & Related papers (2023-07-09T03:25:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.