Quality-agnostic Image Captioning to Safely Assist People with Vision
Impairment
- URL: http://arxiv.org/abs/2304.14623v2
- Date: Mon, 1 May 2023 07:35:37 GMT
- Title: Quality-agnostic Image Captioning to Safely Assist People with Vision
Impairment
- Authors: Lu Yu, Malvina Nikandrou, Jiali Jin, Verena Rieser
- Abstract summary: We show how data augmentation techniques for generating synthetic noise can address data sparsity in this domain.
Second, we enhance the robustness of the model by expanding a state-of-the-art model to a dual network architecture.
Third, we evaluate the prediction reliability using confidence calibration on images with different difficulty/noise levels.
- Score: 11.864465182761945
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automated image captioning has the potential to be a useful tool for people
with vision impairments. Images taken by this user group are often noisy, which
leads to incorrect and even unsafe model predictions. In this paper, we propose
a quality-agnostic framework to improve the performance and robustness of image
captioning models for visually impaired people. We address this problem from
three angles: data, model, and evaluation. First, we show how data augmentation
techniques for generating synthetic noise can address data sparsity in this
domain. Second, we enhance the robustness of the model by expanding a
state-of-the-art model to a dual network architecture, using the augmented data
and leveraging different consistency losses. Our results demonstrate increased
performance, e.g. an absolute improvement of 2.15 on CIDEr, compared to
state-of-the-art image captioning networks, as well as increased robustness to
noise with up to 3 points improvement on CIDEr in more noisy settings. Finally,
we evaluate the prediction reliability using confidence calibration on images
with different difficulty/noise levels, showing that our models perform more
reliably in safety-critical situations. The improved model is part of an
assisted living application, which we develop in partnership with the Royal
National Institute of Blind People.
Related papers
- Leveraging generative models to characterize the failure conditions of image classifiers [5.018156030818883]
We exploit the capacity of producing controllable distributions of high quality image data made available by Generative Adversarial Networks (StyleGAN2)
The failure conditions are expressed as directions of strong performance degradation in the generative model latent space.
arXiv Detail & Related papers (2024-10-01T08:52:46Z) - Indoor scene recognition from images under visual corruptions [3.4861209026118836]
This paper presents an innovative approach to indoor scene recognition that leverages multimodal data fusion.
We examine two multimodal networks that synergize visual features from CNN models with semantic captions via a Graph Convolutional Network (GCN)
Our study shows that this fusion improves markedly model performance, with notable gains in Top-1 accuracy when evaluated against a corrupted subset of the Places365 dataset.
arXiv Detail & Related papers (2024-08-23T12:35:45Z) - Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly.
Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness.
Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings.
This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z) - Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in
Text-to-Image Generation [3.976813869450304]
We focus on enhancing color and contrast, improving generation across multiple aspect ratios, and improving human-centric fine details.
Our model is open-source, and we hope the development of Playground v2.5 provides valuable guidelines for researchers aiming to elevate the aesthetic quality of diffusion-based image generation models.
arXiv Detail & Related papers (2024-02-27T06:31:52Z) - Exploring the Robustness of Human Parsers Towards Common Corruptions [99.89886010550836]
We construct three corruption robustness benchmarks, termed LIP-C, ATR-C, and Pascal-Person-Part-C, to assist us in evaluating the risk tolerance of human parsing models.
Inspired by the data augmentation strategy, we propose a novel heterogeneous augmentation-enhanced mechanism to bolster robustness under commonly corrupted conditions.
arXiv Detail & Related papers (2023-09-02T13:32:14Z) - Helping Visually Impaired People Take Better Quality Pictures [52.03016269364854]
We develop tools to help visually impaired users minimize occurrences of common technical distortions.
We also create a prototype feedback system that helps to guide users to mitigate quality issues.
arXiv Detail & Related papers (2023-05-14T04:37:53Z) - On the Robustness of Quality Measures for GANs [136.18799984346248]
This work evaluates the robustness of quality measures of generative models such as Inception Score (IS) and Fr'echet Inception Distance (FID)
We show that such metrics can also be manipulated by additive pixel perturbations.
arXiv Detail & Related papers (2022-01-31T06:43:09Z) - Towards Unsupervised Deep Image Enhancement with Generative Adversarial
Network [92.01145655155374]
We present an unsupervised image enhancement generative network (UEGAN)
It learns the corresponding image-to-image mapping from a set of images with desired characteristics in an unsupervised manner.
Results show that the proposed model effectively improves the aesthetic quality of images.
arXiv Detail & Related papers (2020-12-30T03:22:46Z) - Inducing Predictive Uncertainty Estimation for Face Recognition [102.58180557181643]
We propose a method for generating image quality training data automatically from'mated-pairs' of face images.
We use the generated data to train a lightweight Predictive Confidence Network, termed as PCNet, for estimating the confidence score of a face image.
arXiv Detail & Related papers (2020-09-01T17:52:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.