Related papers: A Data-Centric Perspective on the Influence of Image Data Quality in Machine Learning Models

A Data-Centric Perspective on the Influence of Image Data Quality in Machine Learning Models

URL: http://arxiv.org/abs/2509.24420v1
Date: Mon, 29 Sep 2025 08:09:21 GMT
Title: A Data-Centric Perspective on the Influence of Image Data Quality in Machine Learning Models
Authors: Pei-Han Chen, Szu-Chi Chung,
Abstract summary: This study investigates methods for systematically assessing image dataset quality.<n>We identify common quality issues and quantify their impact on training.<n>We develop a pipeline that integrates two community-developed tools, CleanVision and Fastdup.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In machine learning, research has traditionally focused on model development, with relatively less attention paid to training data. As model architectures have matured and marginal gains from further refinements diminish, data quality has emerged as a critical factor. However, systematic studies on evaluating and ensuring dataset quality in the image domain remain limited. This study investigates methods for systematically assessing image dataset quality and examines how various image quality factors influence model performance. Using the publicly available and relatively clean CIFAKE dataset, we identify common quality issues and quantify their impact on training. Building on these findings, we develop a pipeline that integrates two community-developed tools, CleanVision and Fastdup. We analyze their underlying mechanisms and introduce several enhancements, including automatic threshold selection to detect problematic images without manual tuning. Experimental results demonstrate that not all quality issues exert the same level of impact. While convolutional neural networks show resilience to certain distortions, they are particularly vulnerable to degradations that obscure critical visual features, such as blurring and severe downscaling. To assess the performance of existing tools and the effectiveness of our proposed enhancements, we formulate the detection of low-quality images as a binary classification task and use the F1 score as the evaluation metric. Our automatic thresholding method improves the F1 score from 0.6794 to 0.9468 under single perturbations and from 0.7447 to 0.8557 under dual perturbations. For near-duplicate detection, our deduplication strategy increases the F1 score from 0.4576 to 0.7928. These results underscore the effectiveness of our workflow and provide a foundation for advancing data quality assessment in image-based machine learning.

Related papers

Implicit Neural Representation-Based Continuous Single Image Super Resolution: An Empirical Study [50.15623093332659]
Implicit neural representation (INR) has become the standard approach for arbitrary-scale image super-resolution (ASSR)<n>We compare existing techniques across diverse settings and present aggregated performance results on multiple image quality metrics.<n>We examine a new loss function that penalizes intensity variations while preserving edges, textures, and finer details during training.
arXiv Detail & Related papers (2026-01-25T07:09:20Z)
Understanding and Improving Training-Free AI-Generated Image Detections with Vision Foundation Models [68.90917438865078]
Deepfake techniques for facial synthesis and editing pose serious risks for generative models.<n>In this paper, we investigate how detection performance varies across model backbones, types, and datasets.<n>We introduce Contrastive Blur, which enhances performance on facial images, and MINDER, which addresses noise type bias, balancing performance across domains.
arXiv Detail & Related papers (2024-11-28T13:04:45Z)
Accelerating Domain-Aware Electron Microscopy Analysis Using Deep Learning Models with Synthetic Data and Image-Wide Confidence Scoring [0.0]
We create a physics-based synthetic image and data generator, resulting in a machine learning model that achieves comparable precision (0.86), recall (0.63), F1 scores (0.71), and engineering property predictions (R2=0.82) Our study demonstrates that synthetic data can eliminate human reliance in ML and provides a means for domain awareness in cases where many feature detections per image are needed.
arXiv Detail & Related papers (2024-08-02T20:15:15Z)
DP-IQA: Utilizing Diffusion Prior for Blind Image Quality Assessment in the Wild [73.6767681305851]
Blind image quality assessment (IQA) in the wild presents significant challenges.<n>Given the difficulty in collecting large-scale training data, leveraging limited data to develop a model with strong generalization remains an open problem.<n>Motivated by the robust image perception capabilities of pre-trained text-to-image (T2I) diffusion models, we propose a novel IQA method, diffusion priors-based IQA.
arXiv Detail & Related papers (2024-05-30T12:32:35Z)
Opinion-Unaware Blind Image Quality Assessment using Multi-Scale Deep Feature Statistics [54.08757792080732]
We propose integrating deep features from pre-trained visual models with a statistical analysis model to achieve opinion-unaware BIQA (OU-BIQA) Our proposed model exhibits superior consistency with human visual perception compared to state-of-the-art BIQA models.
arXiv Detail & Related papers (2024-05-29T06:09:34Z)
Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly. Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness. Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings. This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z)
Helping Visually Impaired People Take Better Quality Pictures [52.03016269364854]
We develop tools to help visually impaired users minimize occurrences of common technical distortions. We also create a prototype feedback system that helps to guide users to mitigate quality issues.
arXiv Detail & Related papers (2023-05-14T04:37:53Z)
Quality-aware Pre-trained Models for Blind Image Quality Assessment [15.566552014530938]
Blind image quality assessment (BIQA) aims to automatically evaluate the perceived quality of a single image. In this paper, we propose to solve the problem by a pretext task customized for BIQA in a self-supervised learning manner.
arXiv Detail & Related papers (2023-03-01T13:52:40Z)
MSTRIQ: No Reference Image Quality Assessment Based on Swin Transformer with Multi-Stage Fusion [8.338999282303755]
We propose a novel algorithm based on the Swin Transformer. It aggregates information from both local and global features to better predict the quality. It ranks 2nd in the no-reference track of NTIRE 2022 Perceptual Image Quality Assessment Challenge.
arXiv Detail & Related papers (2022-05-20T11:34:35Z)
Image Quality Assessment using Contrastive Learning [50.265638572116984]
We train a deep Convolutional Neural Network (CNN) using a contrastive pairwise objective to solve the auxiliary problem. We show through extensive experiments that CONTRIQUE achieves competitive performance when compared to state-of-the-art NR image quality models. Our results suggest that powerful quality representations with perceptual relevance can be obtained without requiring large labeled subjective image quality datasets.
arXiv Detail & Related papers (2021-10-25T21:01:00Z)
Unsupervised machine learning via transfer learning and k-means clustering to classify materials image data [0.0]
This paper demonstrates how to construct, use, and evaluate a high performance unsupervised machine learning system for classifying images. We use the VGG16 convolutional neural network pre-trained on the ImageNet dataset of natural images to extract feature representations for each micrograph. The approach achieves $99.4% pm 0.16%$ accuracy, and the resulting model can be used to classify new images without retraining.
arXiv Detail & Related papers (2020-07-16T14:36:04Z)
No-Reference Image Quality Assessment via Feature Fusion and Multi-Task Learning [29.19484863898778]
Blind or no-reference image quality assessment (NR-IQA) is a fundamental, unsolved, and yet challenging problem. We propose a simple and yet effective general-purpose no-reference (NR) image quality assessment framework based on multi-task learning. Our model employs distortion types as well as subjective human scores to predict image quality.
arXiv Detail & Related papers (2020-06-06T05:04:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.