InternLM-XComposer: A Vision-Language Large Model for Advanced
Text-image Comprehension and Composition
- URL: http://arxiv.org/abs/2309.15112v5
- Date: Thu, 14 Dec 2023 17:21:39 GMT
- Title: InternLM-XComposer: A Vision-Language Large Model for Advanced
Text-image Comprehension and Composition
- Authors: Pan Zhang, Xiaoyi Dong, Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang,
Zhiyuan Zhao, Haodong Duan, Songyang Zhang, Shuangrui Ding, Wenwei Zhang,
Hang Yan, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng
Zhang, Yu Qiao, Dahua Lin, Jiaqi Wang
- Abstract summary: InternLM-XComposer is a vision-language large model that enables advanced image-text comprehension and composition.
It can effortlessly generate coherent and contextual articles that seamlessly integrate images.
It can intelligently identify the areas in the text where images would enhance the content and automatically insert the most appropriate visual candidates.
- Score: 111.65584066987036
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose InternLM-XComposer, a vision-language large model that enables
advanced image-text comprehension and composition. The innovative nature of our
model is highlighted by three appealing properties: 1) Interleaved Text-Image
Composition: InternLM-XComposer can effortlessly generate coherent and
contextual articles that seamlessly integrate images, providing a more engaging
and immersive reading experience. Simply provide a writing instruction, and our
system will generate the corresponding manuscript. It can intelligently
identify the areas in the text where images would enhance the content and
automatically insert the most appropriate visual candidates. 2) Comprehension
with Rich Multilingual Knowledge: The text-image comprehension is empowered by
training on an extensive multi-modal multilingual database with carefully
crafted strategies, resulting in a deep understanding of visual content. 3)
State-of-the-art Performance: Our model consistently achieves state-of-the-art
results across various mainstream benchmarks for vision-language foundational
models, including MME Benchmark, MMBench, MMBench-CN, Seed-Bench, CCBench
(Chinese Cultural Benchmark), QBench and Tiny LVLM. Owing to the absence of
established metrics for quantitatively assessing text-image composition, we
have devised a robust evaluation procedure that comprises both human and
GPT4-Vision (GPT4-V) to ensure reliability. Notably, our InternLM-XComposer
achieves competitive text-image composition scores compared to public
solutions, including GPT4-V and GPT3.5. Collectively, InternLM-XComposer
seamlessly blends advanced text-image comprehension and composition,
revolutionizing vision-language interaction and offering new insights and
opportunities. The InternLM-XComposer model series are publicly available at
https://github.com/InternLM/InternLM-XComposer.
Related papers
- ComAlign: Compositional Alignment in Vision-Language Models [2.3250871476216814]
We introduce Compositional Alignment (ComAlign) to discover more exact correspondence of text and image components.
Our methodology emphasizes that the compositional structure extracted from the text modality must also be retained in the image modality.
We train a lightweight network lying on top of existing visual and language encoders using a small dataset.
arXiv Detail & Related papers (2024-09-12T16:46:41Z) - TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset.
It contains 39,153 text-rich images, captions, and 102,437 questions.
We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z) - StrucTexTv3: An Efficient Vision-Language Model for Text-rich Image Perception, Comprehension, and Beyond [68.0107158115377]
We have crafted an efficient vision-language model, StrucTexTv3, tailored to tackle various intelligent tasks for text-rich images.
We enhance the perception and comprehension abilities of StrucTexTv3 through instruction learning.
Our method achieved SOTA results in text-rich image perception tasks, and significantly improved performance in comprehension tasks.
arXiv Detail & Related papers (2024-05-31T16:55:04Z) - SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension [62.40482764691584]
We introduce SEED-Bench-2-Plus, a benchmark specifically designed for evaluating textbftext-rich visual comprehension of MLLMs.
Our benchmark comprises 2.3K multiple-choice questions with precise human annotations, spanning three broad categories: Charts, Maps, and Webs.
We conduct a thorough evaluation involving 34 prominent MLLMs and emphasize the current limitations of MLLMs in text-rich visual comprehension.
arXiv Detail & Related papers (2024-04-25T17:39:35Z) - FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction [66.98008357232428]
We propose FineMatch, a new aspect-based fine-grained text and image matching benchmark.
FineMatch focuses on text and image mismatch detection and correction.
We show that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches.
arXiv Detail & Related papers (2024-04-23T03:42:14Z) - InternLM-XComposer2: Mastering Free-form Text-Image Composition and
Comprehension in Vision-Language Large Model [108.42241250772643]
We introduce InternLM-XComposer2, a vision-language model excelling in free-form text-image composition and comprehension.
This model goes beyond conventional vision-language understanding, adeptly crafting interleaved text-image content from diverse inputs.
Experimental results demonstrate the superiority of InternLM-XComposer2 based on InternLM2-7B in producing high-quality long-text multi-modal content.
arXiv Detail & Related papers (2024-01-29T18:59:02Z) - Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task
Instruction Tuning [27.544311403607786]
We introduce the Ziya-Visual series, a set of bilingual large-scale vision-language models (LVLMs)
Our models adopt the Querying Transformer from BLIP-2, further exploring the assistance of optimization schemes.
In addition, we stimulate the understanding ability of GPT-4 in multi-modal scenarios, translating our gathered English image-text datasets into Chinese.
arXiv Detail & Related papers (2023-10-12T09:39:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.