MetaCaptioner: Towards Generalist Visual Captioning with Open-source Suites
- URL: http://arxiv.org/abs/2510.12126v3
- Date: Thu, 16 Oct 2025 14:57:08 GMT
- Title: MetaCaptioner: Towards Generalist Visual Captioning with Open-source Suites
- Authors: Zhenxin Lei, Zhangwei Gao, Changyao Tian, Erfei Cui, Guanzhou Chen, Danni Yang, Yuchen Duan, Zhaokai Wang, Wenhao Li, Weiyun Wang, Xiangyu Zhao, Jiayi Ji, Yu Qiao, Wenhai Wang, Gen Luo,
- Abstract summary: Generalist visual captioning requires integrating a series of visual cues into a caption and handling various visual domains.<n>This paper proposes CapFlow, a novel multi-agent collaboration workflow.<n>By capitalizing on open-source models, it is possible to achieve caption quality on par with GPT-4.1 in various domains with an 89.5% reduction in costs.
- Score: 84.44760503711196
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generalist visual captioning goes beyond a simple appearance description task, but requires integrating a series of visual cues into a caption and handling various visual domains. In this task, current open-source models present a large performance gap with commercial ones, which limits various applications such as data synthesis. To bridge the gap, this paper proposes CapFlow, a novel multi-agent collaboration workflow. CapFlow demonstrates for the first time that, by capitalizing on open-source models, it is possible to achieve caption quality on par with GPT-4.1 in various domains with an 89.5% reduction in costs. By leveraging CapFlow as the data synthesizer, we produce high-quality visual captions from image and video domains at scale, and obtain a generalist visual captioner via fine-tuning, namely MetaCaptioner. Through extensive experiments, we show that MetaCaptioner not only achieves comparable captioning capabilities with commercial models but also reaches top-tier multimodal performance in the open-source community. We hope CapFlow and MetaCaptioner can benefit future multimodal research by providing a strong and cost-effective visual captioning solution.
Related papers
- ReCap: Event-Aware Image Captioning with Article Retrieval and Semantic Gaussian Normalization [9.914251544971686]
ReCap is a novel pipeline for event-enriched image retrieval and captioning.<n>It incorporates broader contextual information from relevant articles to generate narrative-rich captions.<n>Our approach addresses the limitations of standard vision-language models.
arXiv Detail & Related papers (2025-09-01T08:48:33Z) - AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning [95.791104183341]
We present the AnyCap Project, an integrated solution spanning model, dataset, and evaluation.<n>ACM reuses the original captions from base models while incorporating user instructions and modality features to generate improved captions.<n>ACM markedly improves caption quality across a diverse set of base models on AnyCapEval.
arXiv Detail & Related papers (2025-07-17T07:04:05Z) - ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing [128.8346376825612]
Key challenges of high-quality image captioning lie in the inherent biases of LVLMs.<n>We propose a scalable debiased captioning strategy, which continuously enriches and calibrates the caption with increased inference budget.<n>Annotating 450K images with ScaleCap and using them for LVLM pretraining leads to consistent performance gains across 11 widely used benchmarks.
arXiv Detail & Related papers (2025-06-24T17:59:55Z) - Panoptic Captioning: Seeking An Equivalency Bridge for Image and Text [15.64048708183143]
This work introduces panoptic captioning, a novel task striving to seek the minimum text equivalence of images.<n>We propose an effective data engine named PancapEngine to produce high-quality data and a novel method named PancapChain to improve panoptic captioning.<n>Our PancapChain-13B model can beat state-of-the-art open-source MLLMs like InternVL-2.5-78B and even surpass proprietary models like GPT-4o and Gemini-2.0-Pro.
arXiv Detail & Related papers (2025-05-22T07:44:10Z) - OmniCaptioner: One Captioner to Rule Them All [33.98387155732322]
We propose a versatile visual captioning framework for generating fine-grained textual descriptions.<n>By converting low-level pixel information into semantically rich textual representations, our framework bridges the gap between visual and textual modalities.<n>We believe the versatility and adaptability of OmniCaptioner can offer a new perspective for bridging the gap between language and visual modalities.
arXiv Detail & Related papers (2025-04-09T17:58:58Z) - AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark [89.73538448786405]
We propose AuroraCap, a video captioner based on a large multimodal model.<n>We implement the token merging strategy, reducing the number of input visual tokens.<n>AuroraCap shows superior performance on various video and image captioning benchmarks.
arXiv Detail & Related papers (2024-10-04T00:13:54Z) - Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models [63.01630478059315]
Recent advancements in multimodal models highlight the value of rewritten captions for improving performance.
It is not clear whether synthetic captions and their interaction with original web-crawled AltTexts in pre-training is still not well understood.
We propose a novel, controllable, and scalable captioning pipeline designed to generate diverse caption formats tailored to various multimodal models.
arXiv Detail & Related papers (2024-10-03T17:54:52Z) - SnapCap: Efficient Snapshot Compressive Video Captioning [18.016261978231835]
Video Captioning (VC) is a challenging multi-modal task since it requires describing the scene in language by understanding various and complex videos.
In this paper, we propose a novel VC pipeline to generate captions directly from the compressed measurement, which can be captured by a snapshot compressive sensing camera.
To better extract language-related visual representations from the compressed measurement, we propose to distill the knowledge from videos via a pre-trained CLIP.
arXiv Detail & Related papers (2024-01-10T03:11:21Z) - Few-shot Action Recognition with Captioning Foundation Models [61.40271046233581]
CapFSAR is a framework to exploit knowledge of multimodal models without manually annotating text.
Visual-text aggregation module based on Transformer is further designed to incorporate cross-modal-temporal complementary information.
experiments on multiple standard few-shot benchmarks demonstrate that the proposed CapFSAR performs favorably against existing methods.
arXiv Detail & Related papers (2023-10-16T07:08:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.