Related papers: FullAnno: A Data Engine for Enhancing Image Comprehension of MLLMs

FullAnno: A Data Engine for Enhancing Image Comprehension of MLLMs

URL: http://arxiv.org/abs/2409.13540v1
Date: Fri, 20 Sep 2024 14:33:17 GMT
Title: FullAnno: A Data Engine for Enhancing Image Comprehension of MLLMs
Authors: Jing Hao, Yuxiang Zhao, Song Chen, Yanpeng Sun, Qiang Chen, Gang Zhang, Kun Yao, Errui Ding, Jingdong Wang,
Abstract summary: FullAnno is a data engine that generates large-scale, high-quality, and fine-grained image annotations. We re-annotated the COCO and Visual Genome datasets using our FullAnno system. Experiments show that the regenerated annotation can significantly enhance the capabilities of LLaVA-v1.5 on several benchmarks.
Score: 58.95386070800286
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (MLLMs) have shown promise in a broad range of vision-language tasks with their strong reasoning and generalization capabilities. However, they heavily depend on high-quality data in the Supervised Fine-Tuning (SFT) phase. The existing approaches aim to curate high-quality data via GPT-4V, but they are not scalable due to the commercial nature of GPT-4V and the simplicity of the prompts used to instruct the model. To this end, we devised the FullAnno system, which is a data engine that can generate large-scale, high-quality, and fine-grained image annotations consisting of the category and position of objects, region descriptions, text information, as well as image dense captions. This engine is characterized by its cascade annotation process, which involves multiple expert models and employs rich prompts to instruct LLMs in generating dense image captions. We re-annotated the COCO and Visual Genome datasets using our FullAnno system, tripling the number of object annotations and increasing the length of the original image captions by a factor of 15. Experiments show that the regenerated annotation can significantly enhance the capabilities of LLaVA-v1.5 on several benchmarks. The re-annotated data are available at: https://arcana-project-page.github.io

Related papers

HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models [15.877790469608662]
We introduce an LVLM-driven data refinement pipeline to enhance the quality of image-text pair data.<n>We propose a training paradigm that extends conventional contrastive learning by incorporating negative descriptions and short tags.<n>Our approach achieves state-of-the-art performance in zero-shot classification, cross-modal retrieval, and fine-grained visual understanding tasks.
arXiv Detail & Related papers (2025-07-30T07:21:36Z)
Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception [10.377899615199278]
High-quality image captions play a crucial role in improving the performance of cross-modal applications. Recent studies have employed multimodal large language models (MLLMs) to generate captions. However, current MLLMs often produce captions that lack fine-grained details or suffer from hallucinations.
arXiv Detail & Related papers (2025-04-09T08:07:46Z)
Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning [56.31096024472269]
We introduce DeCapBench along with a novel metric, DCScore, specifically designed for detailed captioning tasks. DCScore evaluates hallucinations and fine-grained comprehensiveness by deconstructing responses into the smallest self-sufficient units. DeCapBench exhibits a high correlation with VLM arena results on descriptive tasks, surpassing existing benchmarks for vision-language models.
arXiv Detail & Related papers (2025-03-10T22:53:56Z)
LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models [44.578308186225826]
Recent open-vocabulary detectors achieve promising performance with abundant region-level annotated data. We show that an open-vocabulary detector co-training with a large language model by generating image-level detailed captions for each image can further improve performance.
arXiv Detail & Related papers (2025-01-31T08:27:31Z)
AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding [96.01726275876548]
We present AdaptVision, a multimodal large language model specifically designed to dynamically process input images at varying resolutions. We devise a dynamic image partitioning module that adjusts the number of visual tokens according to the size and aspect ratio of images. Our model is capable of processing images with resolutions up to $1008times 1008$.
arXiv Detail & Related papers (2024-08-30T03:16:49Z)
VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC) This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions. In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z)
Benchmarking and Improving Detail Image Caption [12.078715675876674]
Large vision-language model (LVLM) has long been regarded as a fundamental task in visual understanding. We propose to benchmark detail image caption task by curating high-quality evaluation datasets annotated by human experts. We also design a more reliable caption evaluation metric called CAPTURE.
arXiv Detail & Related papers (2024-05-29T13:54:12Z)
MLLMs-Augmented Visual-Language Representation Learning [70.5293060238008]
We demonstrate that Multi-modal Large Language Models (MLLMs) can enhance visual-language representation learning. Our approach is simple, utilizing MLLMs to extend multiple diverse captions for each image. We propose "text shearing" to maintain the quality and availability of extended captions.
arXiv Detail & Related papers (2023-11-30T18:05:52Z)
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs [49.88461345825586]
This paper proposes a new framework to enhance the fine-grained image understanding abilities of MLLMs. We present a new method for constructing the instruction tuning dataset at a low cost by leveraging annotations in existing datasets. We show that our model exhibits a 5.2% accuracy improvement over Qwen-VL and surpasses the accuracy of Kosmos-2 by 24.7%.
arXiv Detail & Related papers (2023-10-01T05:53:15Z)
Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models [50.07056960586183]
We propose Position-enhanced Visual Instruction Tuning (PVIT) to extend the functionality of Multimodal Large Language Models (MLLMs) This integration promotes a more detailed comprehension of images for the MLLM. We present both quantitative experiments and qualitative analysis that demonstrate the superiority of the proposed model.
arXiv Detail & Related papers (2023-08-25T15:33:47Z)
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models [41.84885546518666]
GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text. We present MiniGPT-4, which aligns a frozen visual encoder with a frozen advanced large language model. We also observe other emerging capabilities in MiniGPT-4, including writing stories and poems inspired by given images.
arXiv Detail & Related papers (2023-04-20T18:25:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.