Related papers: MFAI: A Scalable Bayesian Matrix Factorization Approach to Leveraging Auxiliary Information

MFAI: A Scalable Bayesian Matrix Factorization Approach to Leveraging Auxiliary Information

URL: http://arxiv.org/abs/2303.02566v2
Date: Mon, 12 Feb 2024 20:13:17 GMT
Title: MFAI: A Scalable Bayesian Matrix Factorization Approach to Leveraging Auxiliary Information
Authors: Zhiwei Wang, Fa Zhang, Cong Zheng, Xianghong Hu, Mingxuan Cai, Can Yang
Abstract summary: We propose to integrate gradient boosted trees in the probabilistic matrix factorization framework to leverage auxiliary information (MFAI) MFAI naturally inherits several salient features of gradient boosted trees, such as the capability of flexibly modeling nonlinear relationships. MFAI is computationally efficient and scalable to large datasets by exploiting variational inference.
Score: 8.42894516984735
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In various practical situations, matrix factorization methods suffer from poor data quality, such as high data sparsity and low signal-to-noise ratio (SNR). Here, we consider a matrix factorization problem by utilizing auxiliary information, which is massively available in real-world applications, to overcome the challenges caused by poor data quality. Unlike existing methods that mainly rely on simple linear models to combine auxiliary information with the main data matrix, we propose to integrate gradient boosted trees in the probabilistic matrix factorization framework to effectively leverage auxiliary information (MFAI). Thus, MFAI naturally inherits several salient features of gradient boosted trees, such as the capability of flexibly modeling nonlinear relationships and robustness to irrelevant features and missing values in auxiliary information. The parameters in MFAI can be automatically determined under the empirical Bayes framework, making it adaptive to the utilization of auxiliary information and immune to overfitting. Moreover, MFAI is computationally efficient and scalable to large datasets by exploiting variational inference. We demonstrate the advantages of MFAI through comprehensive numerical results from simulation studies and real data analyses. Our approach is implemented in the R package mfair available at https://github.com/YangLabHKUST/mfair.

Related papers

Optimizing Pretraining Data Mixtures with LLM-Estimated Utility [52.08428597962423]
Large Language Models improve with increasing amounts of high-quality training data. We find token-counts outperform manual and learned mixes, indicating that simple approaches for dataset size and diversity are surprisingly effective. We propose two complementary approaches: UtiliMax, which extends token-based $200s by incorporating utility estimates from reduced-scale ablations, achieving up to a 10.6x speedup over manual baselines; and Model Estimated Data Utility (MEDU), which leverages LLMs to estimate data utility from small samples, matching ablation-based performance while reducing computational requirements by $simx.
arXiv Detail & Related papers (2025-01-20T21:10:22Z)
Data Augmentation with Variational Autoencoder for Imbalanced Dataset [1.2289361708127877]
Learning from an imbalanced distribution presents a major challenge in predictive modeling. We develop a novel approach for generating data, combining VAE with a smoothed bootstrap, specifically designed to address the challenges of IR.
arXiv Detail & Related papers (2024-12-09T22:59:03Z)
LaFA: Latent Feature Attacks on Non-negative Matrix Factorization [3.45173496229657]
We introduce a novel class of attacks in NMF termed Latent Feature Attacks (LaFA) Our method utilizes the Feature Error (FE) loss directly on the latent features. To handle large peak-memory overhead gradient from back-propagation in FE attacks, we develop a method based on implicit differentiation.
arXiv Detail & Related papers (2024-08-07T17:13:46Z)
Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs) Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws. Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z)
Spectral Entry-wise Matrix Estimation for Low-Rank Reinforcement Learning [53.445068584013896]
We study matrix estimation problems arising in reinforcement learning (RL) with low-rank structure. In low-rank bandits, the matrix to be recovered specifies the expected arm rewards, and for low-rank Markov Decision Processes (MDPs), it may for example characterize the transition kernel of the MDP. We show that simple spectral-based matrix estimation approaches efficiently recover the singular subspaces of the matrix and exhibit nearly-minimal entry-wise error.
arXiv Detail & Related papers (2023-10-10T17:06:41Z)
Large-scale gradient-based training of Mixtures of Factor Analyzers [67.21722742907981]
This article contributes both a theoretical analysis as well as a new method for efficient high-dimensional training by gradient descent. We prove that MFA training and inference/sampling can be performed based on precision matrices, which does not require matrix inversions after training is completed. Besides the theoretical analysis and matrices, we apply MFA to typical image datasets such as SVHN and MNIST, and demonstrate the ability to perform sample generation and outlier detection.
arXiv Detail & Related papers (2023-08-26T06:12:33Z)
Quadratic Matrix Factorization with Applications to Manifold Learning [1.6795461001108094]
We propose a quadratic matrix factorization (QMF) framework to learn the curved manifold on which the dataset lies. Algorithmically, we propose an alternating minimization algorithm to optimize QMF and establish its theoretical convergence properties. Experiments on a synthetic manifold learning dataset and two real datasets, including the MNIST handwritten dataset and a cryogenic electron microscopy dataset, demonstrate the superiority of the proposed method over its competitors.
arXiv Detail & Related papers (2023-01-30T15:09:00Z)
Non-Negative Matrix Factorization with Scale Data Structure Preservation [23.31865419578237]
The model described in this paper belongs to the family of non-negative matrix factorization methods designed for data representation and dimension reduction. The idea is to add, to the NMF cost function, a penalty term to impose a scale relationship between the pairwise similarity matrices of the original and transformed data points. The proposed clustering algorithm is compared to some existing NMF-based algorithms and to some manifold learning-based algorithms when applied to some real-life datasets.
arXiv Detail & Related papers (2022-09-22T09:32:18Z)
Unitary Approximate Message Passing for Matrix Factorization [90.84906091118084]
We consider matrix factorization (MF) with certain constraints, which finds wide applications in various areas. We develop a Bayesian approach to MF with an efficient message passing implementation, called UAMPMF. We show that UAMPMF significantly outperforms state-of-the-art algorithms in terms of recovery accuracy, robustness and computational complexity.
arXiv Detail & Related papers (2022-07-31T12:09:32Z)
Data Fusion with Latent Map Gaussian Processes [0.0]
Multi-fidelity modeling and calibration are data fusion tasks that ubiquitously arise in engineering design. We introduce a novel approach based on latent-map Gaussian processes (LMGPs) that enables efficient and accurate data fusion.
arXiv Detail & Related papers (2021-12-04T00:54:19Z)
Feature Weighted Non-negative Matrix Factorization [92.45013716097753]
We propose the Feature weighted Non-negative Matrix Factorization (FNMF) in this paper. FNMF learns the weights of features adaptively according to their importances. It can be solved efficiently with the suggested optimization algorithm.
arXiv Detail & Related papers (2021-03-24T21:17:17Z)
Estimating Structural Target Functions using Machine Learning and Influence Functions [103.47897241856603]
We propose a new framework for statistical machine learning of target functions arising as identifiable functionals from statistical models. This framework is problem- and model-agnostic and can be used to estimate a broad variety of target parameters of interest in applied statistics. We put particular focus on so-called coarsening at random/doubly robust problems with partially unobserved information.
arXiv Detail & Related papers (2020-08-14T16:48:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.