Transformers for Tabular Data: A Training Perspective of Self-Attention via Optimal Transport
- URL: http://arxiv.org/abs/2512.09530v1
- Date: Wed, 10 Dec 2025 11:11:52 GMT
- Title: Transformers for Tabular Data: A Training Perspective of Self-Attention via Optimal Transport
- Authors: Antonio Candelieri, Alessandro Quadrio,
- Abstract summary: The study tracks intermediate projections of the self-attention layer during training and evaluates their evolution.<n> Experiments are conducted on classification tasks with two and three classes, as well as on a biomedical dataset.<n>Results indicate that the final self-attention mapping often approximates the OT optimal coupling, yet the training trajectory remains inefficient.
- Score: 42.80108166488221
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This thesis examines self-attention training through the lens of Optimal Transport (OT) and develops an OT-based alternative for tabular classification. The study tracks intermediate projections of the self-attention layer during training and evaluates their evolution using discrete OT metrics, including Wasserstein distance, Monge gap, optimality, and efficiency. Experiments are conducted on classification tasks with two and three classes, as well as on a biomedical dataset. Results indicate that the final self-attention mapping often approximates the OT optimal coupling, yet the training trajectory remains inefficient. Pretraining the MLP section on synthetic data partially improves convergence but is sensitive to their initialization. To address these limitations, an OT-based algorithm is introduced: it generates class-specific dummy Gaussian distributions, computes an OT alignment with the data, and trains an MLP to generalize this mapping. The method achieves accuracy comparable to Transformers while reducing computational cost and scaling more efficiently under standardized inputs, though its performance depends on careful dummy-geometry design. All experiments and implementations are conducted in R.
Related papers
- Orientation-Aware Sparse Tensor PCA for Efficient Unsupervised Feature Selection [7.887782360541216]
We introduce Decomposition (TD) techniques into unsupervised feature selection (UFS)<n>We use the orientation-dependent tensor-tensor product from sparse Singular Value Decomposition to solve the problem.<n>The proposed tensor PCA model can constrain sparsity at the specified mode and yield sparse tensor principal components.
arXiv Detail & Related papers (2024-07-24T04:04:56Z) - Test-time adaptation for geospatial point cloud semantic segmentation with distinct domain shifts [6.80671668491958]
Test-time adaptation (TTA) allows direct adaptation of a pre-trained model to unlabeled data during inference stage without access to source data or additional training.
We propose three domain shift paradigms: photogrammetric to airborne LiDAR, airborne to mobile LiDAR, and synthetic to mobile laser scanning.
Experimental results show our method improves classification accuracy by up to 20% mIoU, outperforming other methods.
arXiv Detail & Related papers (2024-07-08T15:40:28Z) - Efficient Transferability Assessment for Selection of Pre-trained Detectors [63.21514888618542]
This paper studies the efficient transferability assessment of pre-trained object detectors.
We build up a detector transferability benchmark which contains a large and diverse zoo of pre-trained detectors.
Experimental results demonstrate that our method outperforms other state-of-the-art approaches in assessing transferability.
arXiv Detail & Related papers (2024-03-14T14:23:23Z) - Integrating Efficient Optimal Transport and Functional Maps For
Unsupervised Shape Correspondence Learning [43.6925865296259]
We present an unsupervised shape matching framework that integrates functional map regularizers with a novel OT-based loss derived from SWD.
We also introduce an adaptive refinement process utilizing entropy regularized OT, further refining feature alignments for accurate point-to-point correspondences.
Our method demonstrates superior performance in non-rigid shape matching, including near-isometric and non-isometric scenarios.
arXiv Detail & Related papers (2024-03-04T07:21:07Z) - Double-Bounded Optimal Transport for Advanced Clustering and
Classification [58.237576976486544]
We propose Doubly Bounded Optimal Transport (DB-OT), which assumes that the target distribution is restricted within two boundaries instead of a fixed one.
We show that our method can achieve good results with our improved inference scheme in the testing stage.
arXiv Detail & Related papers (2024-01-21T07:43:01Z) - Uncovering the Hidden Cost of Model Compression [43.62624133952414]
Visual Prompting has emerged as a pivotal method for transfer learning in computer vision.
Model compression detrimentally impacts the performance of visual prompting-based transfer.
However, negative effects on calibration are not present when models are compressed via quantization.
arXiv Detail & Related papers (2023-08-29T01:47:49Z) - Improving Molecular Representation Learning with Metric
Learning-enhanced Optimal Transport [49.237577649802034]
We develop a novel optimal transport-based algorithm termed MROT to enhance their generalization capability for molecular regression problems.
MROT significantly outperforms state-of-the-art models, showing promising potential in accelerating the discovery of new substances.
arXiv Detail & Related papers (2022-02-13T04:56:18Z) - Robust Optimal Transport with Applications in Generative Modeling and
Domain Adaptation [120.69747175899421]
Optimal Transport (OT) distances such as Wasserstein have been used in several areas such as GANs and domain adaptation.
We propose a computationally-efficient dual form of the robust OT optimization that is amenable to modern deep learning applications.
Our approach can train state-of-the-art GAN models on noisy datasets corrupted with outlier distributions.
arXiv Detail & Related papers (2020-10-12T17:13:40Z) - Revisiting LSTM Networks for Semi-Supervised Text Classification via
Mixed Objective Function [106.69643619725652]
We develop a training strategy that allows even a simple BiLSTM model, when trained with cross-entropy loss, to achieve competitive results.
We report state-of-the-art results for text classification task on several benchmark datasets.
arXiv Detail & Related papers (2020-09-08T21:55:22Z) - Deep Semi-supervised Knowledge Distillation for Overlapping Cervical
Cell Instance Segmentation [54.49894381464853]
We propose to leverage both labeled and unlabeled data for instance segmentation with improved accuracy by knowledge distillation.
We propose a novel Mask-guided Mean Teacher framework with Perturbation-sensitive Sample Mining.
Experiments show that the proposed method improves the performance significantly compared with the supervised method learned from labeled data only.
arXiv Detail & Related papers (2020-07-21T13:27:09Z) - Deep Active Learning for Biased Datasets via Fisher Kernel
Self-Supervision [5.352699766206807]
Active learning (AL) aims to minimize labeling efforts for data-demanding deep neural networks (DNNs)
We propose a low-complexity method for feature density matching using self-supervised Fisher kernel (FK)
Our method outperforms state-of-the-art methods on MNIST, SVHN, and ImageNet classification while requiring only 1/10th of processing.
arXiv Detail & Related papers (2020-03-01T03:56:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.