FigCaps-HF: A Figure-to-Caption Generative Framework and Benchmark with
Human Feedback
- URL: http://arxiv.org/abs/2307.10867v1
- Date: Thu, 20 Jul 2023 13:40:22 GMT
- Title: FigCaps-HF: A Figure-to-Caption Generative Framework and Benchmark with
Human Feedback
- Authors: Ashish Singh, Prateek Agarwal, Zixuan Huang, Arpita Singh, Tong Yu,
Sungchul Kim, Victor Bursztyn, Nikos Vlassis, Ryan A. Rossi
- Abstract summary: FigCaps-HF is a new framework for figure-caption generation that incorporates domain expert feedback in generating captions optimized for reader preferences.
Our framework comprises of 1) an automatic method for evaluating quality of figure-caption pairs, 2) a novel reinforcement learning with human feedback (RLHF) method to optimize a generative figure-to-caption model for reader preferences.
- Score: 31.230023678131843
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Captions are crucial for understanding scientific visualizations and
documents. Existing captioning methods for scientific figures rely on
figure-caption pairs extracted from documents for training, many of which fall
short with respect to metrics like helpfulness, explainability, and
visual-descriptiveness [15] leading to generated captions being misaligned with
reader preferences. To enable the generation of high-quality figure captions,
we introduce FigCaps-HF a new framework for figure-caption generation that can
incorporate domain expert feedback in generating captions optimized for reader
preferences. Our framework comprises of 1) an automatic method for evaluating
quality of figure-caption pairs, 2) a novel reinforcement learning with human
feedback (RLHF) method to optimize a generative figure-to-caption model for
reader preferences. We demonstrate the effectiveness of our simple learning
framework by improving performance over standard fine-tuning across different
types of models. In particular, when using BLIP as the base model, our RLHF
framework achieves a mean gain of 35.7%, 16.9%, and 9% in ROUGE, BLEU, and
Meteor, respectively. Finally, we release a large-scale benchmark dataset with
human feedback on figure-caption pairs to enable further evaluation and
development of RLHF techniques for this problem.
Related papers
- Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning [56.31096024472269]
We introduce DeCapBench along with a novel metric, DCScore, specifically designed for detailed captioning tasks.
DCScore evaluates hallucinations and fine-grained comprehensiveness by deconstructing responses into the smallest self-sufficient units.
DeCapBench exhibits a high correlation with VLM arena results on descriptive tasks, surpassing existing benchmarks for vision-language models.
arXiv Detail & Related papers (2025-03-10T22:53:56Z) - Retrieval Backward Attention without Additional Training: Enhance Embeddings of Large Language Models via Repetition [4.249842620609683]
This paper focuses on improving the performance of pre-trained language models in zero-shot settings through a simple and easily implementable method.
We propose a novel backward attention mechanism to enhance contextual information encoding.
arXiv Detail & Related papers (2025-02-28T05:19:18Z) - TypeScore: A Text Fidelity Metric for Text-to-Image Generative Models [39.06617653124486]
We introduce a new evaluation framework called TypeScore to assess a model's ability to generate images with high-fidelity embedded text.
Our proposed metric demonstrates greater resolution than CLIPScore to differentiate popular image generation models.
arXiv Detail & Related papers (2024-11-02T07:56:54Z) - CLIP-SCGI: Synthesized Caption-Guided Inversion for Person Re-Identification [9.996589403019675]
Person re-identification (ReID) has recently benefited from large pretrained vision-language models such as Contrastive Language-Image Pre-Training (CLIP)
We propose one straightforward solution by leveraging existing image captioning models to generate pseudo captions for person images.
We introduce CLIP-SCGI, a framework that leverages synthesized captions to guide the learning of discriminative and robust representations.
arXiv Detail & Related papers (2024-10-12T06:24:33Z) - Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning [69.23782518456932]
We propose a novel zero-shot video captioning framework named Retrieval-Enhanced Test-Time Adaptation (RETTA)
We bridge video and text using four key models: a general video-text retrieval model XCLIP, a general image-text matching model CLIP, a text alignment model AnglE, and a text generation model GPT-2.
To address this problem, we propose using learnable tokens as a communication medium among these four frozen models GPT-2, XCLIP, CLIP, and AnglE.
arXiv Detail & Related papers (2024-05-11T16:22:00Z) - Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly.
Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness.
Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings.
This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z) - Confidence-aware Reward Optimization for Fine-tuning Text-to-Image Models [85.96013373385057]
Fine-tuning text-to-image models with reward functions trained on human feedback data has proven effective for aligning model behavior with human intent.
However, excessive optimization with such reward models, which serve as mere proxy objectives, can compromise the performance of fine-tuned models.
We propose TextNorm, a method that enhances alignment based on a measure of reward model confidence estimated across a set of semantically contrastive text prompts.
arXiv Detail & Related papers (2024-04-02T11:40:38Z) - Filter & Align: Leveraging Human Knowledge to Curate Image-Text Data [31.507451966555383]
We present a novel algorithm that incorporates human knowledge on image-text alignment to guide filtering vast corpus of web-crawled image-text datasets.
We collect a diverse image-text dataset where each image is associated with multiple captions from various sources.
We train a reward model on these human-preference annotations to internalize the nuanced human understanding of image-text alignment.
arXiv Detail & Related papers (2023-12-11T05:57:09Z) - Few-shot Action Recognition with Captioning Foundation Models [61.40271046233581]
CapFSAR is a framework to exploit knowledge of multimodal models without manually annotating text.
Visual-text aggregation module based on Transformer is further designed to incorporate cross-modal-temporal complementary information.
experiments on multiple standard few-shot benchmarks demonstrate that the proposed CapFSAR performs favorably against existing methods.
arXiv Detail & Related papers (2023-10-16T07:08:39Z) - Learning to Decompose Visual Features with Latent Textual Prompts [140.2117637223449]
We propose Decomposed Feature Prompting (DeFo) to improve vision-language models.
Our empirical study shows DeFo's significance in improving the vision-language models.
arXiv Detail & Related papers (2022-10-09T15:40:13Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.