Revisiting the Role of Foundation Models in Cell-Level Histopathological Image Analysis under Small-Patch Constraints -- Effects of Training Data Scale and Blur Perturbations on CNNs and Vision Transformers
- URL: http://arxiv.org/abs/2603.04081v1
- Date: Wed, 04 Mar 2026 13:52:19 GMT
- Title: Revisiting the Role of Foundation Models in Cell-Level Histopathological Image Analysis under Small-Patch Constraints -- Effects of Training Data Scale and Blur Perturbations on CNNs and Vision Transformers
- Authors: Hiroki Kagiyama, Toru Nagasaka, Yukari Adachi, Takaaki Tachibana, Ryota Ito, Mitsugu Fujita, Kimihiro Yamashita, Yoshihiro Kakeji,
- Abstract summary: Cell-level pathological image analysis requires working with extremely small image patches (40x40 pixels)<n>We systematically evaluated architectural suitability and data-scale effects for small-patch cell classification.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Background and objective: Cell-level pathological image analysis requires working with extremely small image patches (40x40 pixels), far below standard ImageNet resolutions. It remains unclear whether modern deep learning architectures and foundation models can learn robust and scalable representations under this constraint. We systematically evaluated architectural suitability and data-scale effects for small-patch cell classification. Methods: We analyzed 303 colorectal cancer specimens with CD103/CD8 immunostaining, generating 185,432 annotated cell images. Eight task-specific architectures were trained from scratch at multiple data scales (FlagLimit: 256--16,384 samples per class), and three foundation models were evaluated via linear probing and fine-tuning after resizing inputs to 224x224 pixels. Robustness to blur was assessed using pre- and post-resize Gaussian perturbations. Results: Task-specific models improved consistently with increasing data scale, whereas foundation models saturated at moderate sample sizes. A Vision Transformer optimized for small patches (CustomViT) achieved the highest accuracy, outperforming all foundation models with substantially lower inference cost. Blur robustness was comparable across architectures, with no qualitative advantage observed for foundation models. Conclusion: For cell-level classification under extreme spatial constraints, task-specific architectures are more effective and efficient than foundation models once sufficient training data are available. Higher clean accuracy does not imply superior robustness, and large pre-trained models offer limited benefit in the small-patch regime.
Related papers
- Less Is More: An Explainable AI Framework for Lightweight Malaria Classification [0.0]
This study addresses whether complex neural networks are essential for the simple binary classification task of malaria.<n>We introduce the Extracted Morphological Feature Engineered (EMFE) pipeline, a transparent, reproducible, and low compute machine learning approach.
arXiv Detail & Related papers (2025-11-22T14:46:59Z) - Improving Diagnostic Performance on Small and Imbalanced Datasets Using Class-Based Input Image Composition [0.2676349883103403]
Small, imbalanced datasets and poor input image quality can lead to high false predictions rates with deep learning models.<n>This paper introduces Class-Based Image Composition, an approach that allows us to reformulate training inputs through a fusion of multiple images of the same class into combined visual composites.
arXiv Detail & Related papers (2025-11-05T22:34:06Z) - A Lightweight and Extensible Cell Segmentation and Classification Model for Whole Slide Images [0.0]
We propose a solution that enhances data quality, model performance, and usability by creating a lightweight, cell segmentation and classification model.<n>We update data labels through cross-relabeling to refine annotations of PanNuke and MoNuSAC, producing a unified dataset with seven distinct cell types.<n>Third, to address foundation models' computational demands, we distill knowledge to reduce model size and complexity while maintaining comparable performance.
arXiv Detail & Related papers (2025-02-26T15:19:52Z) - Enhanced Convolutional Neural Networks for Improved Image Classification [0.40964539027092917]
CIFAR-10 is a widely used benchmark to evaluate the performance of classification models on small-scale, multi-class datasets.<n>We propose an enhanced CNN architecture that integrates deeper convolutional blocks, batch normalization, and dropout regularization to achieve superior performance.
arXiv Detail & Related papers (2025-02-02T04:32:25Z) - FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects [55.77542145604758]
FoundationPose is a unified foundation model for 6D object pose estimation and tracking.
Our approach can be instantly applied at test-time to a novel object without fine-tuning.
arXiv Detail & Related papers (2023-12-13T18:28:09Z) - A Simple and Efficient Baseline for Data Attribution on Images [107.12337511216228]
Current state-of-the-art approaches require a large ensemble of as many as 300,000 models to accurately attribute model predictions.
In this work, we focus on a minimalist baseline, utilizing the feature space of a backbone pretrained via self-supervised learning to perform data attribution.
Our method is model-agnostic and scales easily to large datasets.
arXiv Detail & Related papers (2023-11-03T17:29:46Z) - DIME-FM: DIstilling Multimodal and Efficient Foundation Models [72.1900621000677]
Large Vision-Language Foundation Models (VLFM) are trained on large-scale datasets of image-caption pairs.
We introduce a new distillation mechanism (DIME-FM) that allows us to transfer the knowledge contained in large VLFMs to smaller, customized foundation models.
The resulting model "Distill-ViT-B/32" rivals the CLIP-ViT-B/32 model pre-trained on its private WiT dataset.
arXiv Detail & Related papers (2023-03-31T17:47:23Z) - Stacking Ensemble Learning in Deep Domain Adaptation for Ophthalmic
Image Classification [61.656149405657246]
Domain adaptation is effective in image classification tasks where obtaining sufficient label data is challenging.
We propose a novel method, named SELDA, for stacking ensemble learning via extending three domain adaptation methods.
The experimental results using Age-Related Eye Disease Study (AREDS) benchmark ophthalmic dataset demonstrate the effectiveness of the proposed model.
arXiv Detail & Related papers (2022-09-27T14:19:00Z) - Simple Open-Vocabulary Object Detection with Vision Transformers [51.57562920090721]
We propose a strong recipe for transferring image-text models to open-vocabulary object detection.
We use a standard Vision Transformer architecture with minimal modifications, contrastive image-text pre-training, and end-to-end detection fine-tuning.
We provide the adaptation strategies and regularizations needed to attain very strong performance on zero-shot text-conditioned and one-shot image-conditioned object detection.
arXiv Detail & Related papers (2022-05-12T17:20:36Z) - Improved Techniques for Training Score-Based Generative Models [104.20217659157701]
We provide a new theoretical analysis of learning and sampling from score models in high dimensional spaces.
We can effortlessly scale score-based generative models to images with unprecedented resolutions.
Our score-based models can generate high-fidelity samples that rival best-in-class GANs on various image datasets.
arXiv Detail & Related papers (2020-06-16T09:17:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.