Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models
- URL: http://arxiv.org/abs/2504.15271v1
- Date: Mon, 21 Apr 2025 17:57:28 GMT
- Title: Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models
- Authors: Guo Chen, Zhiqi Li, Shihao Wang, Jindong Jiang, Yicheng Liu, Lidong Lu, De-An Huang, Wonmin Byeon, Matthieu Le, Tuomas Rintamaki, Tyler Poon, Max Ehrlich, Tuomas Rintamaki, Tyler Poon, Tong Lu, Limin Wang, Bryan Catanzaro, Jan Kautz, Andrew Tao, Zhiding Yu, Guilin Liu,
- Abstract summary: We introduce Eagle 2.5, a family of frontier vision-language models (VLMs) for long-context multimodal learning.<n>Our work addresses the challenges in long video comprehension and high-resolution image understanding.<n>We propose Eagle-Video-110K, a novel dataset that integrates both story-level and clip-level annotations.
- Score: 90.10322077894033
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We introduce Eagle 2.5, a family of frontier vision-language models (VLMs) for long-context multimodal learning. Our work addresses the challenges in long video comprehension and high-resolution image understanding, introducing a generalist framework for both tasks. The proposed training framework incorporates Automatic Degrade Sampling and Image Area Preservation, two techniques that preserve contextual integrity and visual details. The framework also includes numerous efficiency optimizations in the pipeline for long-context data training. Finally, we propose Eagle-Video-110K, a novel dataset that integrates both story-level and clip-level annotations, facilitating long-video understanding. Eagle 2.5 demonstrates substantial improvements on long-context multimodal benchmarks, providing a robust solution to the limitations of existing VLMs. Notably, our best model Eagle 2.5-8B achieves 72.4% on Video-MME with 512 input frames, matching the results of top-tier commercial model such as GPT-4o and large-scale open-source models like Qwen2.5-VL-72B and InternVL2.5-78B.
Related papers
- LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding [29.719450799231705]
Vision-Language Models (VLMs) obtain frame-level understanding capabilities through multi-frame input.
Video Large Language Models (Video-LLMs) capture temporal relationships within visual features but are limited by the scarcity of high-quality video-text datasets.
We propose Lightweight Video Compression (LVC), a novel method featuring the Query-Attention Video Compression mechanism.
arXiv Detail & Related papers (2025-04-09T12:51:10Z) - SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding [70.84791600974337]
We introduce SlowFast-LLaVA-1.5 (abbreviated as SF-LLaVA-1.5), a family of video large language models (LLMs)<n>We incorporate the two-stream SlowFast mechanism into a streamlined training pipeline.<n>We perform joint video-image training on a carefully curated data mixture of only publicly available datasets.
arXiv Detail & Related papers (2025-03-24T17:59:07Z) - AMD-Hummingbird: Towards an Efficient Text-to-Video Model [12.09360569154206]
Text-to-Video (T2V) generation has attracted significant attention for its ability to synthesize realistic videos from textual descriptions.
Most prior work prioritizes visual fidelity while overlooking the need for smaller, more efficient models suitable for real-world deployment.
We propose a lightweight T2V framework, termed Hummingbird, which prunes existing models and enhances visual quality through visual feedback learning.
arXiv Detail & Related papers (2025-03-24T11:13:33Z) - InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model [80.93387166769679]
We present IXC-2.5-Reward, a simple yet effective multi-modal reward model that aligns Large Vision Language Models with human preferences.
IXC-2.5-Reward achieves excellent results on the latest multi-modal reward model benchmark and shows competitive performance on text-only reward model benchmarks.
arXiv Detail & Related papers (2025-01-21T18:47:32Z) - GIRAFFE: Design Choices for Extending the Context Length of Visual Language Models [20.976319536167512]
We aim to establish an effective solution that enhances long context performance of Visual Language Models.
We propose Giraffe, which is effectively extended to 128K lengths.
We will open-source the code, data, and models.
arXiv Detail & Related papers (2024-12-17T09:57:21Z) - Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling [128.24325909395188]
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0.<n>InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet.<n>We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems.
arXiv Detail & Related papers (2024-12-06T18:57:08Z) - LongVILA: Scaling Long-Context Visual Language Models for Long Videos [86.28679075537089]
LongVILA is a full-stack solution for long-context visual-language models.<n>LongVILA efficiently extends the number of video frames of VILA from 8 to 2048, achieving 99.8% accuracy in 6,000-frame (more than 1 million tokens) video needle-in-a-haystack.
arXiv Detail & Related papers (2024-08-19T17:48:08Z) - InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output [138.18086961321146]
InternLM-XComposer-2.5 (IXC-2.5) is a versatile large-vision language model that supports long-contextual input and output.
IXC-2.5 excels in various text-image comprehension and composition applications.
IXC-2.5 has been evaluated on 28 benchmarks, outperforming existing open-source state-of-the-art models on 16 benchmarks.
arXiv Detail & Related papers (2024-07-03T17:59:21Z) - Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis [118.08008540513596]
Video-MME is the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis.
We extensively evaluate various state-of-the-art MLLMs, including GPT-4 series and Gemini 1.5 Pro, as well as open-source image models.
Our experiments reveal that Gemini 1.5 Pro is the best-performing commercial model, significantly outperforming the open-source models.
arXiv Detail & Related papers (2024-05-31T17:59:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.