Fugu-MT 論文翻訳(概要): InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition

論文の概要: InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition

arxiv url: http://arxiv.org/abs/2309.15112v5
Date: Thu, 14 Dec 2023 17:21:39 GMT
ステータス: 翻訳完了
システム内更新日: 2023-12-16 04:06:55.522383
Title: InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
Title（参考訳）: InternLM-XComposer:高度なテキストイメージ理解と構成のための視覚言語大モデル
Authors: Pan Zhang, Xiaoyi Dong, Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang, Zhiyuan Zhao, Haodong Duan, Songyang Zhang, Shuangrui Ding, Wenwei Zhang, Hang Yan, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, Jiaqi Wang
Abstract要約: InternLM-XComposerは、高度な画像テキストの理解と合成を可能にする視覚言語による大規模モデルである。シームレスに画像を統合するコヒーレントでコンテキスト的な記事を生成することができる。画像がコンテンツを強化するテキスト内の領域をインテリジェントに識別し、最も適切な視覚的候補を自動的に挿入する。
参考スコア（独自算出の注目度）: 111.65584066987036
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose InternLM-XComposer, a vision-language large model that enables advanced image-text comprehension and composition. The innovative nature of our model is highlighted by three appealing properties: 1) Interleaved Text-Image Composition: InternLM-XComposer can effortlessly generate coherent and contextual articles that seamlessly integrate images, providing a more engaging and immersive reading experience. Simply provide a writing instruction, and our system will generate the corresponding manuscript. It can intelligently identify the areas in the text where images would enhance the content and automatically insert the most appropriate visual candidates. 2) Comprehension with Rich Multilingual Knowledge: The text-image comprehension is empowered by training on an extensive multi-modal multilingual database with carefully crafted strategies, resulting in a deep understanding of visual content. 3) State-of-the-art Performance: Our model consistently achieves state-of-the-art results across various mainstream benchmarks for vision-language foundational models, including MME Benchmark, MMBench, MMBench-CN, Seed-Bench, CCBench (Chinese Cultural Benchmark), QBench and Tiny LVLM. Owing to the absence of established metrics for quantitatively assessing text-image composition, we have devised a robust evaluation procedure that comprises both human and GPT4-Vision (GPT4-V) to ensure reliability. Notably, our InternLM-XComposer achieves competitive text-image composition scores compared to public solutions, including GPT4-V and GPT3.5. Collectively, InternLM-XComposer seamlessly blends advanced text-image comprehension and composition, revolutionizing vision-language interaction and offering new insights and opportunities. The InternLM-XComposer model series are publicly available at https://github.com/InternLM/InternLM-XComposer.
Abstract（参考訳）: InternLM-XComposerは、高度な画像テキスト理解と合成を可能にする視覚言語大モデルである。私たちのモデルの革新的な性質は、3つの魅力的な性質によって強調される。 1)Interleaved Text- Image composition: InternLM-XComposerは、画像をシームレスに統合し、より魅力的で没入的な読書体験を提供するコヒーレントで文脈的な記事を生成することができる。書記命令を単に提供すれば,本システムは対応する原稿を生成する。画像がコンテンツを強化するテキスト内の領域をインテリジェントに識別し、最も適切な視覚的候補を自動的に挿入する。 2) リッチ多言語知識の理解: テキストイメージの理解は、慎重に構築された戦略を持つ広範囲なマルチモーダル多言語データベースでトレーニングすることで、視覚的内容の深い理解をもたらす。 3) 最先端性能: 我々のモデルは, MME Benchmark, MMBench, MMBench-CN, Seed-Bench, CCBench (China Cultural Benchmark), QBench, Tiny LVLM など, ビジョンベースモデルの様々な主要なベンチマークにおいて, 常に最先端の結果を達成している。テキスト画像合成を定量的に評価するための確立された指標がないため、信頼性を確保するために、人間とgpt4-vision(gpt4-v)の両方を含む堅牢な評価手順を考案した。特に、我々のInternLM-XComposerは、GPT4-VやGPT3.5といった公開ソリューションと比較して、競合するテキスト画像合成スコアを達成しています。集合的に、InternLM-XComposerは高度なテキストイメージの理解と構成をシームレスにブレンドし、視覚と言語間の相互作用を革新し、新たな洞察と機会を提供する。 InternLM-XComposerモデルシリーズはhttps://github.com/InternLM/InternLM-XComposerで公開されている。

論文の概要: InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition

関連論文リスト