World Model on Million-Length Video And Language With Blockwise RingAttention
- URL: http://arxiv.org/abs/2402.08268v4
- Date: Mon, 03 Feb 2025 21:47:31 GMT
- Title: World Model on Million-Length Video And Language With Blockwise RingAttention
- Authors: Hao Liu, Wilson Yan, Matei Zaharia, Pieter Abbeel,
- Abstract summary: We set new benchmarks in language retrieval and new capabilities in long video understanding.
We present an efficient open-source implementation for scalable training on long sequences.
We open-source a family of 7B parameter models capable of processing long text documents and videos exceeding 1M tokens.
- Score: 75.82014160713348
- License:
- Abstract: Enabling long-context understanding remains a key challenge in scaling existing sequence models -- a crucial component in developing generally intelligent models that can process and operate over long temporal horizons that potentially consist of millions of tokens. In this paper, we aim to address these challenges by providing a comprehensive exploration of the full development process for producing 1M context language models and video-language models, setting new benchmarks in language retrieval and new capabilities in long video understanding. We detail our long context data curation process, progressive context extension from 4K to 1M tokens, and present an efficient open-source implementation for scalable training on long sequences. Additionally, we open-source a family of 7B parameter models capable of processing long text documents and videos exceeding 1M tokens.
Related papers
- Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy [111.1291107651131]
Long-VITA is a large multi-modal model for long-context visual-language understanding tasks.
It is adept at concurrently processing and analyzing modalities of image, video, and text over 4K frames or 1M tokens.
Long-VITA is fully reproducible and supports both NPU and GPU platforms for training and testing.
arXiv Detail & Related papers (2025-02-07T18:59:56Z) - Bootstrap Your Own Context Length [74.61148597039248]
We introduce a bootstrapping approach to train long-context language models by exploiting their short-context capabilities only.
The proposed data synthesis workflow requires only a short-context language model, a text retriever, and a document collection.
We conduct experiments with the open-source Llama-3 family of models and demonstrate that our method can successfully extend the context length to up to 1M tokens.
arXiv Detail & Related papers (2024-12-25T10:08:54Z) - Long Context Transfer from Language to Vision [74.78422371545716]
Video sequences offer valuable temporal information, but existing large multimodal models (LMMs) fall short in understanding extremely long videos.
In this paper, we approach this problem from the perspective of the language model.
By simply extrapolating the context length of the language backbone, we enable LMMs to comprehend orders of magnitude more visual tokens without any video training.
arXiv Detail & Related papers (2024-06-24T17:58:06Z) - Scaling Up Video Summarization Pretraining with Large Language Models [73.74662411006426]
We introduce an automated and scalable pipeline for generating a large-scale video summarization dataset.
We analyze the limitations of existing approaches and propose a new video summarization model that effectively addresses them.
Our work also presents a new benchmark dataset that contains 1200 long videos each with high-quality summaries annotated by professionals.
arXiv Detail & Related papers (2024-04-04T11:59:06Z) - BAMBOO: A Comprehensive Benchmark for Evaluating Long Text Modeling Capacities of Large Language Models [141.21603469555225]
Large language models (LLMs) have achieved dramatic proficiency over NLP tasks with normal length.
We propose BAMBOO, a multi-task long context benchmark.
It consists of 10 datasets from 5 different long text understanding tasks.
arXiv Detail & Related papers (2023-09-23T11:36:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.