Related papers: Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input

Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input

URL: http://arxiv.org/abs/2408.15542v1
Date: Wed, 28 Aug 2024 05:34:14 GMT
Title: Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input
Authors: Jiajun Liu, Yibing Wang, Hanghang Ma, Xiaoping Wu, Xiaoqi Ma, Xiaoming Wei, Jianbin Jiao, Enhua Wu, Jie Hu,
Abstract summary: Kangaroo is a powerful Video LMM aimed at addressing the challenges of processing long videos. Data curation system to build a large-scale dataset with high-quality annotations for vision-language pre-training and instruction tuning. curriculum training pipeline with gradually increasing resolution and number of input frames to accommodate long videos.
Score: 34.50993235961505
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Rapid advancements have been made in extending Large Language Models (LLMs) to Large Multi-modal Models (LMMs). However, extending input modality of LLMs to video data remains a challenging endeavor, especially for long videos. Due to insufficient access to large-scale high-quality video data and the excessive compression of visual features, current methods exhibit limitations in effectively processing long videos. In this paper, we introduce Kangaroo, a powerful Video LMM aimed at addressing these challenges. Confronted with issue of inadequate training data, we develop a data curation system to build a large-scale dataset with high-quality annotations for vision-language pre-training and instruction tuning. In addition, we design a curriculum training pipeline with gradually increasing resolution and number of input frames to accommodate long videos. Evaluation results demonstrate that, with 8B parameters, Kangaroo achieves state-of-the-art performance across a variety of video understanding benchmarks while exhibiting competitive results on others. Particularly, on benchmarks specialized for long videos, Kangaroo excels some larger models with over 10B parameters and proprietary models.

Related papers

ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts [56.75723197779384]
ARC-Hunyuan-Video is a multimodal model that processes visual, audio, and textual signals end-to-end for structured comprehension.<n>Our model is capable of multi-granularity timestamped video captioning and summarization, open-ended video question answering, temporal video grounding, and video reasoning.
arXiv Detail & Related papers (2025-07-28T15:52:36Z)
Unleashing Hour-Scale Video Training for Long Video-Language Understanding [61.717205915329664]
We present VideoMarathon, a large-scale hour-long video instruction-following dataset.<n>This dataset includes around 9,700 hours of long videos sourced from diverse domains, ranging from 3 to 60 minutes per video.<n>We propose Hour-LLaVA, a powerful and efficient Video-LMM for hour-scale video-language modeling.
arXiv Detail & Related papers (2025-06-05T17:59:04Z)
SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding [70.84791600974337]
We introduce SlowFast-LLaVA-1.5 (abbreviated as SF-LLaVA-1.5), a family of video large language models (LLMs) We incorporate the two-stream SlowFast mechanism into a streamlined training pipeline. We perform joint video-image training on a carefully curated data mixture of only publicly available datasets.
arXiv Detail & Related papers (2025-03-24T17:59:07Z)
$\infty$-Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation [19.616624959353697]
$infty$-Video can process arbitrarily long videos through a continuous-time long-term memory (LTM) consolidation mechanism. Our framework augments video Q-formers by allowing them to process video contexts efficiently and without requiring additional training.
arXiv Detail & Related papers (2025-01-31T12:45:46Z)
HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding [52.696422425058245]
We build a large-scale hour-long long video benchmark, HLV-1K, designed to evaluate long video understanding models. HLV-1K comprises 1009 hour-long videos with 14,847 high-quality question answering (QA) and multi-choice question asnwering (MCQA) We evaluate our benchmark using existing state-of-the-art methods and demonstrate its value for testing deep long video understanding capabilities at different levels and for various tasks.
arXiv Detail & Related papers (2025-01-03T05:32:37Z)
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling [43.485687038460895]
Long-context video modeling is critical for multimodal large language models (MLLMs) This paper aims to address this issue from aspects of model architecture, training data, training strategy and evaluation benchmark. We build a powerful video MLLM named VideoChat-Flash, which shows a leading performance on both mainstream long and short video benchmarks.
arXiv Detail & Related papers (2024-12-31T18:01:23Z)
SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis [52.050036778325094]
We introduce SALOVA: Segment-Augmented Video Assistant, a novel video-LLM framework designed to enhance the comprehension of lengthy video content. We present a high-quality collection of 87.8K long videos, each densely captioned at the segment level to enable models to capture scene continuity and maintain rich context. Our framework mitigates the limitations of current video-LMMs by allowing for precise identification and retrieval of relevant video segments in response to queries.
arXiv Detail & Related papers (2024-11-25T08:04:47Z)
Video Instruction Tuning With Synthetic Data [84.64519990333406]
We create a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM.
arXiv Detail & Related papers (2024-10-03T17:36:49Z)
Visual Context Window Extension: A New Perspective for Long Video Understanding [45.134271969594614]
We tackle the challenge of long video understanding from the perspective of context windows. We propose to adapt LMMs for long video understanding tasks by extending the visual context window. Our method consistently improves the performance as the number of video frames increases.
arXiv Detail & Related papers (2024-09-30T07:25:16Z)
Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding [26.72068455284472]
Video-XL is an extra-long vision language model designed for efficient hour-scale video understanding. Our model achieves promising results on popular long video understanding benchmarks.
arXiv Detail & Related papers (2024-09-22T15:13:31Z)
LVBench: An Extreme Long Video Understanding Benchmark [38.839913137854104]
We introduce LVBench, a benchmark specifically designed for long video understanding. Our dataset comprises publicly sourced videos and encompasses a diverse set of tasks aimed at long video comprehension and information extraction.
arXiv Detail & Related papers (2024-06-12T09:36:52Z)
CinePile: A Long Video Question Answering Dataset and Benchmark [55.30860239555001]
We present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding. Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects. We fine-tuned open-source Video-LLMs on the training split and evaluated both open-source and proprietary video-centric LLMs on the test split of our dataset.
arXiv Detail & Related papers (2024-05-14T17:59:02Z)
Scaling Up Video Summarization Pretraining with Large Language Models [73.74662411006426]
We introduce an automated and scalable pipeline for generating a large-scale video summarization dataset. We analyze the limitations of existing approaches and propose a new video summarization model that effectively addresses them. Our work also presents a new benchmark dataset that contains 1200 long videos each with high-quality summaries annotated by professionals.
arXiv Detail & Related papers (2024-04-04T11:59:06Z)
LongVLM: Efficient Long Video Understanding via Large Language Models [55.813206751150716]
LongVLM is a simple yet powerful VideoLLM for long video understanding. We encode video representations that incorporate both local and global information. Our model produces more precise responses for long video understanding.
arXiv Detail & Related papers (2024-04-04T11:33:29Z)
Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning [39.80936685227549]
Large-scale video-language pre-training has shown significant improvement in video-language understanding tasks. We introduce a Long-Form VIdeo-LAnguage pre-training model (VILA) and train it on a large-scale long-form video and paragraph dataset. We fine-tune the model on seven downstream long-form video-language understanding tasks, achieve new state-of-the-art performances.
arXiv Detail & Related papers (2022-10-12T09:08:27Z)
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [80.7397409377659]
We propose an end-to-end trainable model that is designed to take advantage of both large-scale image and video captioning datasets. Our model is flexible and can be trained on both image and video text datasets, either independently or in conjunction. We show that this approach yields state-of-the-art results on standard downstream video-retrieval benchmarks.
arXiv Detail & Related papers (2021-04-01T17:48:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.