OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding
- URL: http://arxiv.org/abs/2407.04923v1
- Date: Sat, 6 Jul 2024 02:16:10 GMT
- Title: OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding
- Authors: Tiancheng Zhao, Qianqian Zhang, Kyusong Lee, Peng Liu, Lu Zhang, Chunxin Fang, Jiajia Liao, Kelei Jiang, Yibo Ma, Ruochen Xu,
- Abstract summary: OmChat is a model designed to excel in handling long contexts and video understanding tasks.
It uses a dynamic vision encoding process to effectively handle images of various resolutions, capturing fine details across a range of image qualities.
With support for a context length of up to 512K, OmChat demonstrates promising performance in tasks involving multiple images and videos.
- Score: 34.17871202332497
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce OmChat, a model designed to excel in handling long contexts and video understanding tasks. OmChat's new architecture standardizes how different visual inputs are processed, making it more efficient and adaptable. It uses a dynamic vision encoding process to effectively handle images of various resolutions, capturing fine details across a range of image qualities. OmChat utilizes an active progressive multimodal pretraining strategy, which gradually increases the model's capacity for long contexts and enhances its overall abilities. By selecting high-quality data during training, OmChat learns from the most relevant and informative data points. With support for a context length of up to 512K, OmChat demonstrates promising performance in tasks involving multiple images and videos, outperforming most open-source models in these benchmarks. Additionally, OmChat proposes a prompting strategy for unifying complex multimodal inputs including single image text, multi-image text and videos, and achieving competitive performance on single-image benchmarks. To further evaluate the model's capabilities, we proposed a benchmark dataset named Temporal Visual Needle in a Haystack. This dataset assesses OmChat's ability to comprehend temporal visual details within long videos. Our analysis highlights several key factors contributing to OmChat's success: support for any-aspect high image resolution, the active progressive pretraining strategy, and high-quality supervised fine-tuning datasets. This report provides a detailed overview of OmChat's capabilities and the strategies that enhance its performance in visual understanding.
Related papers
- Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge [57.01131456894516]
Current video understanding models struggle with processing long video sequences, supporting multi-turn dialogues, and adapting to real-world dynamic scenarios.
We propose StreamChat, a training-free framework for streaming video reasoning and conversational interaction.
Our framework incorporates a parallel system scheduling strategy that enhances processing speed and reduces latency, ensuring robust performance in real-world applications.
arXiv Detail & Related papers (2025-01-23T08:33:10Z) - VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling [43.485687038460895]
This paper introduces a Hierarchical visual token Compression (HiCo) method designed for high-fidelity representation.
HiCo capitalizes on the redundancy of visual information in long videos to compress long video context from the clip-level to the video-level.
VideoChat-Flash shows the leading performance on both mainstream long and short video benchmarks at the 2B and 7B model scale.
arXiv Detail & Related papers (2024-12-31T18:01:23Z) - Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks [62.758680527838436]
Leopard is a vision-language model for handling vision-language tasks involving multiple text-rich images.
First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios.
Second, we developed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length.
arXiv Detail & Related papers (2024-10-02T16:55:01Z) - Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input [34.50993235961505]
Kangaroo is a powerful Video LMM aimed at addressing the challenges of processing long videos.
Data curation system to build a large-scale dataset with high-quality annotations for vision-language pre-training and instruction tuning.
curriculum training pipeline with gradually increasing resolution and number of input frames to accommodate long videos.
arXiv Detail & Related papers (2024-08-28T05:34:14Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - COSMO: COntrastive Streamlined MultimOdal Model with Interleaved
Pre-Training [119.03392147066093]
Recent autoregressive vision-language models have excelled in few-shot text generation tasks but face challenges in alignment tasks.
We introduce the contrastive loss into text generation models, partitioning the language model into dedicated unimodal text processing and adept multimodal data handling components.
To bridge this gap, this work introduces VideoDatasetName, an inaugural interleaved video-text dataset featuring comprehensive captions.
arXiv Detail & Related papers (2024-01-01T18:58:42Z) - TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding [20.037781644877388]
TimeChat is a time-sensitive multimodal large language model specifically designed for long video understanding.
Our model incorporates two key architectural contributions: (1) a timestamp-aware frame encoder that binds visual content with the timestamp of each frame, and (2) a sliding video Q-Former that produces a video token sequence of varying lengths.
arXiv Detail & Related papers (2023-12-04T17:09:52Z) - Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding [55.65727739645824]
Chat-UniVi is a Unified Vision-language model capable of comprehending and engaging in conversations involving images and videos.
We employ a set of dynamic visual tokens to uniformly represent images and videos.
We leverage a multi-scale representation, enabling the model to perceive both high-level semantic concepts and low-level visual details.
arXiv Detail & Related papers (2023-11-14T10:11:36Z) - Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models [60.81438804824749]
Multimodal instruction-following models extend capabilities by integrating both text and images.
Existing models such as MiniGPT-4 and LLaVA face challenges in maintaining dialogue coherence in scenarios involving multiple images.
We introduce SparklesDialogue, the first machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions.
We then present SparklesChat, a multimodal instruction-following model for open-ended dialogues across multiple images.
arXiv Detail & Related papers (2023-08-31T05:15:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.