Related papers: LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models

LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models

URL: http://arxiv.org/abs/2502.14834v1
Date: Thu, 20 Feb 2025 18:47:36 GMT
Title: LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models
Authors: Shangqing Tu, Yucheng Wang, Daniel Zhang-Li, Yushi Bai, Jifan Yu, Yuhao Wu, Lei Hou, Huiqin Liu, Zhiyuan Liu, Bin Xu, Juanzi Li,
Abstract summary: LongWriter-V-22k is a dataset of 22,158 examples with multiple input images, an instruction, and corresponding outputs ranging from 0 to 10,000 words.<n>We propose IterDPO, which breaks long outputs into segments and uses iterative corrections to form preference pairs with the original outputs.<n>Our 7B parameter model, trained with LongWriter-V-22k and IterDPO, achieves impressive performance on a benchmark.
Score: 60.79418872734049
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Existing Large Vision-Language Models (LVLMs) can process inputs with context lengths up to 128k visual and text tokens, yet they struggle to generate coherent outputs beyond 1,000 words. We find that the primary limitation is the absence of long output examples during supervised fine-tuning (SFT). To tackle this issue, we introduce LongWriter-V-22k, a SFT dataset comprising 22,158 examples, each with multiple input images, an instruction, and corresponding outputs ranging from 0 to 10,000 words. Moreover, to achieve long outputs that maintain high-fidelity to the input images, we employ Direct Preference Optimization (DPO) to the SFT model. Given the high cost of collecting human feedback for lengthy outputs (e.g., 3,000 words), we propose IterDPO, which breaks long outputs into segments and uses iterative corrections to form preference pairs with the original outputs. Additionally, we develop MMLongBench-Write, a benchmark featuring six tasks to evaluate the long-generation capabilities of VLMs. Our 7B parameter model, trained with LongWriter-V-22k and IterDPO, achieves impressive performance on this benchmark, outperforming larger proprietary models like GPT-4o. Code and data: https://github.com/THU-KEG/LongWriter-V

Related papers

LongCaptioning: Unlocking the Power of Long Video Caption Generation in Large Multimodal Models [52.05596926411973]
Large Multimodal Models (LMMs) have demonstrated exceptional performance in video captioning tasks. In this paper, we investigate the limitations of LMMs in generating long captions for long videos. We propose the LongCaption-Agent, a framework that synthesizes long caption data by hierarchical semantic aggregation.
arXiv Detail & Related papers (2025-02-21T11:40:23Z)
Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy [111.1291107651131]
Long-VITA is a large multi-modal model for long-context visual-language understanding tasks.<n>It is adept at concurrently processing and analyzing modalities of image, video, and text over 4K frames or 1M tokens.<n>Long-VITA is fully reproducible and supports both NPU and GPU platforms for training and testing.
arXiv Detail & Related papers (2025-02-07T18:59:56Z)
LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation [74.89981179257194]
LongProc (Long Procedural Generation) is a new benchmark for evaluating long-context language models (LCLMs) LongProc consists of six diverse procedural generation tasks, such as extracting structured information from HTML pages into a TSV format and executing complex search procedures to create travel plans. We evaluate 23 LCLMs, including instruction-tuned models and recent reasoning models, on LongProc at three difficulty levels, with the maximum number of output tokens set at 500, 2K, and 8K.
arXiv Detail & Related papers (2025-01-09T18:16:55Z)
LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs [57.23637303451716]
Long context large language models (LLMs) can process inputs up to 100,000 tokens, yet struggle to generate outputs exceeding even a modest length of 2,000 words. We introduce AgentWrite, an agent-based pipeline that decomposes ultra-long generation tasks into subtasks. We construct LongWriter-6k, a dataset containing 6,000 SFT data with output lengths ranging from 2k to 32k words.
arXiv Detail & Related papers (2024-08-13T17:46:12Z)
Long Context Transfer from Language to Vision [74.78422371545716]
Video sequences offer valuable temporal information, but existing large multimodal models (LMMs) fall short in understanding extremely long videos. In this paper, we approach this problem from the perspective of the language model. By simply extrapolating the context length of the language backbone, we enable LMMs to comprehend orders of magnitude more visual tokens without any video training.
arXiv Detail & Related papers (2024-06-24T17:58:06Z)
LongAlign: A Recipe for Long Context Alignment of Large Language Models [61.85923382850057]
LongAlign is a recipe of the instruction data, training, and evaluation for long context alignment. We construct a long instruction-following dataset using Self-Instruct. We adopt the packing and sorted strategies to speed up supervised fine-tuning on data with varied length distributions.
arXiv Detail & Related papers (2024-01-31T18:29:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.