Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy
- URL: http://arxiv.org/abs/2502.05177v2
- Date: Wed, 19 Feb 2025 00:10:06 GMT
- Title: Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy
- Authors: Yunhang Shen, Chaoyou Fu, Shaoqi Dong, Xiong Wang, Yi-Fan Zhang, Peixian Chen, Mengdan Zhang, Haoyu Cao, Ke Li, Xiawu Zheng, Yan Zhang, Yiyi Zhou, Ran He, Caifeng Shan, Rongrong Ji, Xing Sun,
- Abstract summary: Long-VITA is a large multi-modal model for long-context visual-language understanding tasks.
It is adept at concurrently processing and analyzing modalities of image, video, and text over 4K frames or 1M tokens.
Long-VITA is fully reproducible and supports both NPU and GPU platforms for training and testing.
- Score: 111.1291107651131
- License:
- Abstract: We introduce Long-VITA, a simple yet effective large multi-modal model for long-context visual-language understanding tasks. It is adept at concurrently processing and analyzing modalities of image, video, and text over 4K frames or 1M tokens while delivering advanced performances on short-context multi-modal tasks. We propose an effective multi-modal training schema that starts with large language models and proceeds through vision-language alignment, general knowledge learning, and two sequential stages of long-sequence fine-tuning. We further implement context-parallelism distributed inference and logits-masked language modeling head to scale Long-VITA to infinitely long inputs of images and texts during model inference. Regarding training data, Long-VITA is built on a mix of 17M samples from public datasets only and demonstrates the state-of-the-art performance on various multi-modal benchmarks, compared against recent cutting-edge models with internal data. Long-VITA is fully reproducible and supports both NPU and GPU platforms for training and testing. By leveraging our inference designs, Long-VITA models achieve a remarkable 2x prefill speedup and 4x context length extension in single node with 8 GPUs. We hope Long-VITA can serve as a competitive baseline and offer valuable insights for the open-source community in advancing long-context multi-modal understanding.
Related papers
- V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding [40.784423313750075]
Vision-Language Models (VLMs) have shown promising capabilities in handling various multimodal tasks, yet they struggle in long-context scenarios.
We propose a novel positional encoding approach that employs variable increments for visual tokens, enabling more efficient management of long multimodal sequences.
We show that the fine-tuned model achieves strong performance on both standard and long-context multimodal tasks.
arXiv Detail & Related papers (2024-12-12T18:59:46Z) - Long Context Transfer from Language to Vision [74.78422371545716]
Video sequences offer valuable temporal information, but existing large multimodal models (LMMs) fall short in understanding extremely long videos.
In this paper, we approach this problem from the perspective of the language model.
By simply extrapolating the context length of the language backbone, we enable LMMs to comprehend orders of magnitude more visual tokens without any video training.
arXiv Detail & Related papers (2024-06-24T17:58:06Z) - From Text to Pixel: Advancing Long-Context Understanding in MLLMs [70.78454154014989]
We introduce SEEKER, a multimodal large language model designed to tackle this issue.
SEEKER aims to optimize the compact encoding of long text by compressing the text sequence into the visual pixel space via images.
Our experiments on six long-context multimodal tasks demonstrate that SEEKER can leverage fewer image tokens to convey the same amount of textual information compared with the OCR-based approach.
arXiv Detail & Related papers (2024-05-23T06:17:23Z) - World Model on Million-Length Video And Language With Blockwise RingAttention [75.82014160713348]
We set new benchmarks in language retrieval and new capabilities in long video understanding.
We present an efficient open-source implementation for scalable training on long sequences.
We open-source a family of 7B parameter models capable of processing long text documents and videos exceeding 1M tokens.
arXiv Detail & Related papers (2024-02-13T07:47:36Z) - Long-range Multimodal Pretraining for Movie Understanding [79.63187251571391]
We introduce Long-range Multimodal Pretraining, a strategy, and a model that leverages movie data to train transferable multimodal and cross-modal encoders.
Our key idea is to learn from all modalities in a movie by observing and extracting relationships over a long-range.
Our model achieves state-of-the-art on several LVU tasks while being much more data efficient than previous works.
arXiv Detail & Related papers (2023-08-18T18:52:59Z) - A Survey of Vision-Language Pre-training from the Lens of Multimodal
Machine Translation [13.426403221815063]
This paper surveys the landscape of language-and-vision pre-training from the lens of multimodal machine translation.
We summarize the common architectures, pre-training objectives, and datasets from literature and conjecture what further is needed to make progress on multimodal machine translation.
arXiv Detail & Related papers (2023-06-12T15:56:10Z) - ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text
Pre-training [40.05046655477684]
ERNIE-ViL 2.0 is a Multi-View Contrastive learning framework to build intra-modal and inter-modal correlations between diverse views simultaneously.
We construct sequences of object tags as a special textual view to narrow the cross-modal semantic gap on noisy image-text pairs.
ERNIE-ViL 2.0 achieves competitive results on English cross-modal retrieval.
arXiv Detail & Related papers (2022-09-30T07:20:07Z) - Enabling Multimodal Generation on CLIP via Vision-Language Knowledge
Distillation [79.72299298976525]
We propose to augment a vision-language pre-training model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD)
Experiments show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning.
The original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.
arXiv Detail & Related papers (2022-03-12T09:33:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.