Feature Augmentations for High-Dimensional Learning
- URL: http://arxiv.org/abs/2509.00232v1
- Date: Fri, 29 Aug 2025 20:50:16 GMT
- Title: Feature Augmentations for High-Dimensional Learning
- Authors: Xiaonan Zhu, Bingyan Wang, Jianqing Fan,
- Abstract summary: We propose a technique to enhance the performance of supervised learning algorithms by augmenting features with factors extracted from design matrices and their transformations.<n>This is implemented by using the factors and idiosyncratic residuals which significantly weaken the correlations between input variables and hence increase the interpretability of learning algorithms and numerical stability.
- Score: 9.20063546548507
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: High-dimensional measurements are often correlated which motivates their approximation by factor models. This holds also true when features are engineered via low-dimensional interactions or kernel tricks. This often results in over parametrization and requires a fast dimensionality reduction. We propose a simple technique to enhance the performance of supervised learning algorithms by augmenting features with factors extracted from design matrices and their transformations. This is implemented by using the factors and idiosyncratic residuals which significantly weaken the correlations between input variables and hence increase the interpretability of learning algorithms and numerical stability. Extensive experiments on various algorithms and real-world data in diverse fields are carried out, among which we put special emphasis on the stock return prediction problem with Chinese financial news data due to the increasing interest in NLP problems in financial studies. We verify the capability of the proposed feature augmentation approach to boost overall prediction performance with the same algorithm. The approach bridges a gap in research that has been overlooked in previous studies, which focus either on collecting additional data or constructing more powerful algorithms, whereas our method lies in between these two directions using a simple PCA augmentation.
Related papers
- Efficient Machine Unlearning via Influence Approximation [75.31015485113993]
Influence-based unlearning has emerged as a prominent approach to estimate the impact of individual training samples on model parameters without retraining.<n>This paper establishes a theoretical link between memorizing (incremental learning) and forgetting (unlearning)<n>We introduce the Influence Approximation Unlearning algorithm for efficient machine unlearning from the incremental perspective.
arXiv Detail & Related papers (2025-07-31T05:34:27Z) - Newfluence: Boosting Model interpretability and Understanding in High Dimensions [17.73837631710377]
We introduce an alternative approximation, called Newfluence, that maintains similar computational efficiency while offering significantly improved accuracy.<n>Newfluence is expected to provide more accurate insights than many existing methods for interpreting complex AI models.
arXiv Detail & Related papers (2025-07-16T04:22:16Z) - Fairness-Driven LLM-based Causal Discovery with Active Learning and Dynamic Scoring [1.5498930424110338]
Causal discovery (CD) plays a pivotal role in numerous scientific fields by clarifying the causal relationships that underlie phenomena observed in diverse disciplines.<n>Despite significant advancements in CD algorithms, their application faces challenges due to the high computational demands and complexities of large-scale data.<n>This paper introduces a framework that leverages Large Language Models (LLMs) for CD, utilizing a metadata-based approach akin to the reasoning processes of human experts.
arXiv Detail & Related papers (2025-03-21T22:58:26Z) - Enhancing Feature Selection and Interpretability in AI Regression Tasks Through Feature Attribution [38.53065398127086]
This study investigates the potential of feature attribution methods to filter out uninformative features in input data for regression problems.
We introduce a feature selection pipeline that combines Integrated Gradients with k-means clustering to select an optimal set of variables from the initial data space.
To validate the effectiveness of this approach, we apply it to a real-world industrial problem - blade vibration analysis in the development process of turbo machinery.
arXiv Detail & Related papers (2024-09-25T09:50:51Z) - Semantic-Preserving Feature Partitioning for Multi-View Ensemble
Learning [11.415864885658435]
We introduce the Semantic-Preserving Feature Partitioning (SPFP) algorithm, a novel method grounded in information theory.
The SPFP algorithm effectively partitions datasets into multiple semantically consistent views, enhancing the multi-view ensemble learning process.
It maintains model accuracy while significantly improving uncertainty measures in scenarios where high generalization performance is achievable.
arXiv Detail & Related papers (2024-01-11T20:44:45Z) - Efficient Model-Free Exploration in Low-Rank MDPs [76.87340323826945]
Low-Rank Markov Decision Processes offer a simple, yet expressive framework for RL with function approximation.
Existing algorithms are either (1) computationally intractable, or (2) reliant upon restrictive statistical assumptions.
We propose the first provably sample-efficient algorithm for exploration in Low-Rank MDPs.
arXiv Detail & Related papers (2023-07-08T15:41:48Z) - Representation Learning with Multi-Step Inverse Kinematics: An Efficient
and Optimal Approach to Rich-Observation RL [106.82295532402335]
Existing reinforcement learning algorithms suffer from computational intractability, strong statistical assumptions, and suboptimal sample complexity.
We provide the first computationally efficient algorithm that attains rate-optimal sample complexity with respect to the desired accuracy level.
Our algorithm, MusIK, combines systematic exploration with representation learning based on multi-step inverse kinematics.
arXiv Detail & Related papers (2023-04-12T14:51:47Z) - Offline Reinforcement Learning with Differentiable Function
Approximation is Provably Efficient [65.08966446962845]
offline reinforcement learning, which aims at optimizing decision-making strategies with historical data, has been extensively applied in real-life applications.
We take a step by considering offline reinforcement learning with differentiable function class approximation (DFA)
Most importantly, we show offline differentiable function approximation is provably efficient by analyzing the pessimistic fitted Q-learning algorithm.
arXiv Detail & Related papers (2022-10-03T07:59:42Z) - Improved Algorithms for Neural Active Learning [74.89097665112621]
We improve the theoretical and empirical performance of neural-network(NN)-based active learning algorithms for the non-parametric streaming setting.
We introduce two regret metrics by minimizing the population loss that are more suitable in active learning than the one used in state-of-the-art (SOTA) related work.
arXiv Detail & Related papers (2022-10-02T05:03:38Z) - Automatic Data Augmentation via Invariance-Constrained Learning [94.27081585149836]
Underlying data structures are often exploited to improve the solution of learning tasks.
Data augmentation induces these symmetries during training by applying multiple transformations to the input data.
This work tackles these issues by automatically adapting the data augmentation while solving the learning task.
arXiv Detail & Related papers (2022-09-29T18:11:01Z) - Statistically Guided Divide-and-Conquer for Sparse Factorization of
Large Matrix [2.345015036605934]
We formulate the statistical problem as a sparse factor regression and tackle it with a divide-conquer approach.
In the first stage division, we consider both latent parallel approaches for simplifying the task into a set of co-parsesparserank estimation (CURE) problems.
In the second stage division, we innovate a stagewise learning technique, consisting of a sequence simple incremental paths, to efficiently trace out the whole solution of CURE.
arXiv Detail & Related papers (2020-03-17T19:12:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.