Related papers: A Fine-Grained Image Description Generation Method Based on Joint Objectives

A Fine-Grained Image Description Generation Method Based on Joint Objectives

URL: http://arxiv.org/abs/2311.12799v1
Date: Sat, 2 Sep 2023 03:22:39 GMT
Title: A Fine-Grained Image Description Generation Method Based on Joint Objectives
Authors: Yifan Zhang and Chunzhen Lin and Donglin Cao and Dazhen Lin
Abstract summary: We propose an innovative Fine-grained Image Description Generation model based on Joint Objectives. We introduce new object-based evaluation metrics to more intuitively assess the model's performance in handling description repetition and omission. Experimental results demonstrate that our proposed method significantly improves the CIDEr evaluation metric.
Score: 7.565093400979752
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The goal of fine-grained image description generation techniques is to learn detailed information from images and simulate human-like descriptions that provide coherent and comprehensive textual details about the image content. Currently, most of these methods face two main challenges: description repetition and omission. Moreover, the existing evaluation metrics cannot clearly reflect the performance of models on these two issues. To address these challenges, we propose an innovative Fine-grained Image Description Generation model based on Joint Objectives. Furthermore, we introduce new object-based evaluation metrics to more intuitively assess the model's performance in handling description repetition and omission. This novel approach combines visual features at both the image level and object level to maximize their advantages and incorporates an object penalty mechanism to reduce description repetition. Experimental results demonstrate that our proposed method significantly improves the CIDEr evaluation metric, indicating its excellent performance in addressing description repetition and omission issues.

Related papers

From Visual Explanations to Counterfactual Explanations with Latent Diffusion [11.433402357922414]
We propose a new approach to tackle two key challenges in recent prominent works. First, we determine which specific counterfactual features are crucial for distinguishing the "concept" of the target class from the original class. Second, we provide valuable explanations for the non-robust classifier without relying on the support of an adversarially robust model.
arXiv Detail & Related papers (2025-04-12T13:04:00Z)
Grounding Descriptions in Images informs Zero-Shot Visual Recognition [47.66166611138081]
We propose GRAIN, a new pretraining strategy aimed at aligning representations at both fine and coarse levels simultaneously. We demonstrate the enhanced zero-shot performance of our model compared to current state-of-the art methods.
arXiv Detail & Related papers (2024-12-05T18:52:00Z)
Generalizable Single-view Object Pose Estimation by Two-side Generating and Matching [19.730504197461144]
We present a novel generalizable object pose estimation method to determine the object pose using only one RGB image. Our method offers generalization to unseen objects without extensive training, operates with a single reference image of the object, and eliminates the need for 3D object models or multiple views of the object.
arXiv Detail & Related papers (2024-11-24T14:31:50Z)
A Survey on All-in-One Image Restoration: Taxonomy, Evaluation and Future Trends [67.43992456058541]
Image restoration (IR) refers to the process of improving visual quality of images while removing degradation, such as noise, blur, weather effects, and so on. Traditional IR methods typically target specific types of degradation, which limits their effectiveness in real-world scenarios with complex distortions. The all-in-one image restoration (AiOIR) paradigm has emerged, offering a unified framework that adeptly addresses multiple degradation types.
arXiv Detail & Related papers (2024-10-19T11:11:09Z)
Beyond Thumbs Up/Down: Untangling Challenges of Fine-Grained Feedback for Text-to-Image Generation [67.88747330066049]
Fine-grained feedback captures nuanced distinctions in image quality and prompt-alignment. We show that demonstrating its superiority to coarse-grained feedback is not automatic. We identify key challenges in eliciting and utilizing fine-grained feedback.
arXiv Detail & Related papers (2024-06-24T17:19:34Z)
Six-CD: Benchmarking Concept Removals for Benign Text-to-image Diffusion Models [58.74606272936636]
Text-to-image (T2I) diffusion models have shown exceptional capabilities in generating images that closely correspond to textual prompts. The models could be exploited for malicious purposes, such as generating images with violence or nudity, or creating unauthorized portraits of public figures in inappropriate contexts. concept removal methods have been proposed to modify diffusion models to prevent the generation of malicious and unwanted concepts.
arXiv Detail & Related papers (2024-06-21T03:58:44Z)
Inpainting the Gaps: A Novel Framework for Evaluating Explanation Methods in Vision Transformers [10.97134072427802]
In this work, we propose a novel evaluation framework called textbfInpainting the Gaps (InG). InG is applied to the PartImageNet dataset to evaluate the performance of popular explanation methods for three training strategies of the Vision Transformer (ViT) To the best of our knowledge, InG is the first semi-synthetic framework for the evaluation of ViT explanation methods.
arXiv Detail & Related papers (2024-06-17T13:37:35Z)
Information Theoretic Text-to-Image Alignment [49.396917351264655]
We present a novel method that relies on an information-theoretic alignment measure to steer image generation. Our method is on-par or superior to the state-of-the-art, yet requires nothing but a pre-trained denoising network to estimate MI.
arXiv Detail & Related papers (2024-05-31T12:20:02Z)
Confidence-aware Reward Optimization for Fine-tuning Text-to-Image Models [85.96013373385057]
Fine-tuning text-to-image models with reward functions trained on human feedback data has proven effective for aligning model behavior with human intent. However, excessive optimization with such reward models, which serve as mere proxy objectives, can compromise the performance of fine-tuned models. We propose TextNorm, a method that enhances alignment based on a measure of reward model confidence estimated across a set of semantically contrastive text prompts.
arXiv Detail & Related papers (2024-04-02T11:40:38Z)
QUASAR: QUality and Aesthetics Scoring with Advanced Representations [20.194917729936357]
This paper introduces a new data-driven, non-parametric method for image quality and aesthetics assessment. We eliminate the need for expressive textual embeddings by proposing efficient image anchors in the data.
arXiv Detail & Related papers (2024-03-11T16:21:50Z)
DreamArtist++: Controllable One-Shot Text-to-Image Generation via Positive-Negative Adapter [63.622879199281705]
Some example-based image generation approaches have been proposed, emphi.e. generating new concepts based on absorbing the salient features of a few input references. We propose a simple yet effective framework, namely DreamArtist, which adopts a novel positive-negative prompt-tuning learning strategy on the pre-trained diffusion model. We have conducted extensive experiments and evaluated the proposed method from image similarity (fidelity) and diversity, generation controllability, and style cloning.
arXiv Detail & Related papers (2022-11-21T10:37:56Z)
A Visual Navigation Perspective for Category-Level Object Pose Estimation [41.60364392204057]
This paper studies category-level object pose estimation based on a single monocular image. Recent advances in pose-aware generative models have paved the way for addressing this challenging task using analysis-by-synthesis.
arXiv Detail & Related papers (2022-03-25T10:57:37Z)
STEEX: Steering Counterfactual Explanations with Semantics [28.771471624014065]
Deep learning models are increasingly used in safety-critical applications. For simple images, such as low-resolution face portraits, visual counterfactual explanations has recently been proposed. We propose a new generative counterfactual explanation framework that produces plausible and sparse modifications.
arXiv Detail & Related papers (2021-11-17T13:20:29Z)
Revisiting The Evaluation of Class Activation Mapping for Explainability: A Novel Metric and Experimental Analysis [54.94682858474711]
Class Activation Mapping (CAM) approaches provide an effective visualization by taking weighted averages of the activation maps. We propose a novel set of metrics to quantify explanation maps, which show better effectiveness and simplify comparisons between approaches.
arXiv Detail & Related papers (2021-04-20T21:34:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.