LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models
- URL: http://arxiv.org/abs/2502.14834v1
- Date: Thu, 20 Feb 2025 18:47:36 GMT
- Title: LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models
- Authors: Shangqing Tu, Yucheng Wang, Daniel Zhang-Li, Yushi Bai, Jifan Yu, Yuhao Wu, Lei Hou, Huiqin Liu, Zhiyuan Liu, Bin Xu, Juanzi Li,
- Abstract summary: LongWriter-V-22k is a dataset of 22,158 examples with multiple input images, an instruction, and corresponding outputs ranging from 0 to 10,000 words.
We propose IterDPO, which breaks long outputs into segments and uses iterative corrections to form preference pairs with the original outputs.
Our 7B parameter model, trained with LongWriter-V-22k and IterDPO, achieves impressive performance on a benchmark.
- Score: 60.79418872734049
- License:
- Abstract: Existing Large Vision-Language Models (LVLMs) can process inputs with context lengths up to 128k visual and text tokens, yet they struggle to generate coherent outputs beyond 1,000 words. We find that the primary limitation is the absence of long output examples during supervised fine-tuning (SFT). To tackle this issue, we introduce LongWriter-V-22k, a SFT dataset comprising 22,158 examples, each with multiple input images, an instruction, and corresponding outputs ranging from 0 to 10,000 words. Moreover, to achieve long outputs that maintain high-fidelity to the input images, we employ Direct Preference Optimization (DPO) to the SFT model. Given the high cost of collecting human feedback for lengthy outputs (e.g., 3,000 words), we propose IterDPO, which breaks long outputs into segments and uses iterative corrections to form preference pairs with the original outputs. Additionally, we develop MMLongBench-Write, a benchmark featuring six tasks to evaluate the long-generation capabilities of VLMs. Our 7B parameter model, trained with LongWriter-V-22k and IterDPO, achieves impressive performance on this benchmark, outperforming larger proprietary models like GPT-4o. Code and data: https://github.com/THU-KEG/LongWriter-V
Related papers
- Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy [111.1291107651131]
Long-VITA is a large multi-modal model for long-context visual-language understanding tasks.
It is adept at concurrently processing and analyzing modalities of image, video, and text over 4K frames or 1M tokens.
Long-VITA is fully reproducible and supports both NPU and GPU platforms for training and testing.
arXiv Detail & Related papers (2025-02-07T18:59:56Z) - LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation [74.89981179257194]
LongProc (Long Procedural Generation) is a new benchmark for evaluating long-context language models (LCLMs)
LongProc consists of six diverse procedural generation tasks, such as extracting structured information from HTML pages into a TSV format and executing complex search procedures to create travel plans.
We evaluate 17 LCLMs on LongProc across three difficulty levels, with maximum numbers of output tokens set at 500, 2K, and 8K. Notably, while all tested models claim a context window size above 32K tokens, open-weight models typically falter on 2K-token tasks, and closed-source models like GPT
arXiv Detail & Related papers (2025-01-09T18:16:55Z) - LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs [57.23637303451716]
Long context large language models (LLMs) can process inputs up to 100,000 tokens, yet struggle to generate outputs exceeding even a modest length of 2,000 words.
We introduce AgentWrite, an agent-based pipeline that decomposes ultra-long generation tasks into subtasks.
We construct LongWriter-6k, a dataset containing 6,000 SFT data with output lengths ranging from 2k to 32k words.
arXiv Detail & Related papers (2024-08-13T17:46:12Z) - Long Context Transfer from Language to Vision [74.78422371545716]
Video sequences offer valuable temporal information, but existing large multimodal models (LMMs) fall short in understanding extremely long videos.
In this paper, we approach this problem from the perspective of the language model.
By simply extrapolating the context length of the language backbone, we enable LMMs to comprehend orders of magnitude more visual tokens without any video training.
arXiv Detail & Related papers (2024-06-24T17:58:06Z) - LongAlign: A Recipe for Long Context Alignment of Large Language Models [61.85923382850057]
LongAlign is a recipe of the instruction data, training, and evaluation for long context alignment.
We construct a long instruction-following dataset using Self-Instruct.
We adopt the packing and sorted strategies to speed up supervised fine-tuning on data with varied length distributions.
arXiv Detail & Related papers (2024-01-31T18:29:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.