Leveraging Model Soups to Classify Intangible Cultural Heritage Images from the Mekong Delta
- URL: http://arxiv.org/abs/2603.02181v1
- Date: Mon, 02 Mar 2026 18:50:15 GMT
- Title: Leveraging Model Soups to Classify Intangible Cultural Heritage Images from the Mekong Delta
- Authors: Quoc-Khang Tran, Minh-Thien Nguyen, Nguyen-Khang Pham,
- Abstract summary: classification of Intangible Cultural Heritage (ICH) images in the Mekong Delta poses unique challenges.<n>We propose a robust framework that integrates the hybrid CoAtNet architecture with model soups.<n>Our approach achieves state-of-the-art results with 72.36% top-1 accuracy and 69.28% macro F1-score, outperforming strong baselines.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The classification of Intangible Cultural Heritage (ICH) images in the Mekong Delta poses unique challenges due to limited annotated data, high visual similarity among classes, and domain heterogeneity. In such low-resource settings, conventional deep learning models often suffer from high variance or overfit to spurious correlations, leading to poor generalization. To address these limitations, we propose a robust framework that integrates the hybrid CoAtNet architecture with model soups, a lightweight weight-space ensembling technique that averages checkpoints from a single training trajectory without increasing inference cost. CoAtNet captures both local and global patterns through stage-wise fusion of convolution and self-attention. We apply two ensembling strategies - greedy and uniform soup - to selectively combine diverse checkpoints into a final model. Beyond performance improvements, we analyze the ensembling effect through the lens of bias-variance decomposition. Our findings show that model soups reduces variance by stabilizing predictions across diverse model snapshots, while introducing minimal additional bias. Furthermore, using cross-entropy-based distance metrics and Multidimensional Scaling (MDS), we show that model soups selects geometrically diverse checkpoints, unlike Soft Voting, which blends redundant models centered in output space. Evaluated on the ICH-17 dataset (7,406 images across 17 classes), our approach achieves state-of-the-art results with 72.36% top-1 accuracy and 69.28% macro F1-score, outperforming strong baselines including ResNet-50, DenseNet-121, and ViT. These results underscore that diversity-aware checkpoint averaging provides a principled and efficient way to reduce variance and enhance generalization in culturally rich, data-scarce classification tasks.
Related papers
- ACE-Merging: Data-Free Model Merging with Adaptive Covariance Estimation [34.173549610331385]
Model merging aims to combine multiple task-specific expert models into a single model.<n>Interference among experts, especially when they are trained on different objectives, often leads to significant performance degradation.<n>acem is an Adaptive Covariance Estimation framework that effectively mitigates inter-task interference.
arXiv Detail & Related papers (2026-03-03T12:53:04Z) - When Are Two Scores Better Than One? Investigating Ensembles of Diffusion Models [22.019987128734282]
We find that while ensembling the scores generally improves the score-matching loss and model likelihood, it fails to consistently enhance perceptual quality metrics such as FID on image datasets.<n>We also provide theoretical insights into the summing of score models, which shed light not only on ensembling but also on several model composition techniques.
arXiv Detail & Related papers (2026-01-16T17:07:25Z) - MS-ISSM: Objective Quality Assessment of Point Clouds Using Multi-scale Implicit Structural Similarity [65.85858856481131]
unstructured and irregular nature of point clouds poses a significant challenge for objective quality assessment (PCQA)<n>We propose the Multi-scale Implicit Structural Similarity Measurement (MS-ISSM)
arXiv Detail & Related papers (2026-01-03T14:58:52Z) - Trade-offs in Cross-Domain Generalization of Foundation Model Fine-Tuned for Biometric Applications [18.08946802592489]
Foundation models such as CLIP have demonstrated exceptional zero- and few-shot transfer capabilities across diverse vision tasks.<n>However, fine-tuned for highly specialized biometric tasks, face recognition (FR), morphing attack detection (MAD), and presentation attack detection (PAD) may suffer from over-specialization.<n>We systematically quantify these trade-offs by evaluating three instances of CLIP fine-tuned for FR, MAD, and PAD.
arXiv Detail & Related papers (2025-09-18T12:58:18Z) - Learning Majority-to-Minority Transformations with MMD and Triplet Loss for Imbalanced Classification [0.5390869741300152]
Class imbalance in supervised classification often degrades model performance by biasing predictions toward the majority class.<n>We introduce an oversampling framework that learns a parametric transformation to map majority samples into the minority distribution.<n>Our approach minimizes the mean maximum discrepancy (MMD) between transformed and true minority samples for global alignment.
arXiv Detail & Related papers (2025-09-15T01:47:29Z) - A Simple and Generalist Approach for Panoptic Segmentation [57.94892855772925]
We propose a simple generalist framework based on a deep encoder - shallow decoder architecture with per-pixel prediction.<n>We show that this is due to imbalance during training and propose a novel method for reducing it.<n>Our method achieves panoptic quality (PQ) of 55.1 on the challenging MS-COCO dataset.
arXiv Detail & Related papers (2024-08-29T13:02:12Z) - GRIDS: Grouped Multiple-Degradation Restoration with Image Degradation Similarity [35.11349385659554]
Grouped Restoration with Image Degradation Similarity (GRIDS) is a novel approach that harmonizes the competing objectives inherent in multiple-degradation restoration.
Based on the degradation similarity, GRIDS divides restoration tasks into one of the optimal groups, where tasks within the same group are highly correlated.
Trained models within each group show significant improvements, with an average improvement of 0.09dB over single-task upper bound models.
arXiv Detail & Related papers (2024-07-17T02:43:32Z) - SR-Stereo & DAPE: Stepwise Regression and Pre-trained Edges for Practical Stereo Matching [2.8908326904081334]
We propose a novel stepwise regression architecture to overcome domain discrepancies.
To enhance the edge awareness of models adapting new domains with sparse ground truth, we propose Domain Adaptation based on Pre-trained Edges (DAPE)
The proposed SR-Stereo and DAPE are extensively evaluated on SceneFlow, KITTI, Middbury 2014 and ETH3D.
arXiv Detail & Related papers (2024-06-11T05:25:25Z) - Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution [67.9215891673174]
We propose score entropy as a novel loss that naturally extends score matching to discrete spaces.
We test our Score Entropy Discrete Diffusion models on standard language modeling tasks.
arXiv Detail & Related papers (2023-10-25T17:59:12Z) - Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers.
We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes.
We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z) - CAMERO: Consistency Regularized Ensemble of Perturbed Language Models
with Weight Sharing [83.63107444454938]
We propose a consistency-regularized ensemble learning approach based on perturbed models, named CAMERO.
Specifically, we share the weights of bottom layers across all models and apply different perturbations to the hidden representations for different models, which can effectively promote the model diversity.
Our experiments using large language models demonstrate that CAMERO significantly improves the generalization performance of the ensemble model.
arXiv Detail & Related papers (2022-04-13T19:54:51Z) - Model soups: averaging weights of multiple fine-tuned models improves
accuracy without increasing inference time [69.7693300927423]
We show that averaging the weights of multiple models fine-tuned with different hyper parameter configurations improves accuracy and robustness.
We show that the model soup approach extends to multiple image classification and natural language processing tasks.
arXiv Detail & Related papers (2022-03-10T17:03:49Z) - Diversity inducing Information Bottleneck in Model Ensembles [73.80615604822435]
In this paper, we target the problem of generating effective ensembles of neural networks by encouraging diversity in prediction.
We explicitly optimize a diversity inducing adversarial loss for learning latent variables and thereby obtain diversity in the output predictions necessary for modeling multi-modal data.
Compared to the most competitive baselines, we show significant improvements in classification accuracy, under a shift in the data distribution.
arXiv Detail & Related papers (2020-03-10T03:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.