Combining Pre-Trained Models for Enhanced Feature Representation in Reinforcement Learning
- URL: http://arxiv.org/abs/2507.07197v1
- Date: Wed, 09 Jul 2025 18:13:52 GMT
- Title: Combining Pre-Trained Models for Enhanced Feature Representation in Reinforcement Learning
- Authors: Elia Piccoli, Malio Li, Giacomo Carfì, Vincenzo Lomonaco, Davide Bacciu,
- Abstract summary: Reinforcement Learning (RL) focuses on maximizing the cumulative reward obtained via agent's interaction with the environment.<n>We propose Weight Sharing Attention (WSA), a new architecture to combine embeddings of multiple pre-trained models to shape an enriched state representation.
- Score: 16.04558746520946
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recent focus and release of pre-trained models have been a key components to several advancements in many fields (e.g. Natural Language Processing and Computer Vision), as a matter of fact, pre-trained models learn disparate latent embeddings sharing insightful representations. On the other hand, Reinforcement Learning (RL) focuses on maximizing the cumulative reward obtained via agent's interaction with the environment. RL agents do not have any prior knowledge about the world, and they either learn from scratch an end-to-end mapping between the observation and action spaces or, in more recent works, are paired with monolithic and computationally expensive Foundational Models. How to effectively combine and leverage the hidden information of different pre-trained models simultaneously in RL is still an open and understudied question. In this work, we propose Weight Sharing Attention (WSA), a new architecture to combine embeddings of multiple pre-trained models to shape an enriched state representation, balancing the tradeoff between efficiency and performance. We run an extensive comparison between several combination modes showing that WSA obtains comparable performance on multiple Atari games compared to end-to-end models. Furthermore, we study the generalization capabilities of this approach and analyze how scaling the number of models influences agents' performance during and after training.
Related papers
- GRAM: A Generative Foundation Reward Model for Reward Generalization [48.63394690265176]
We develop a generative reward model that is first trained via large-scale unsupervised learning and then fine-tuned via supervised learning.<n>This model generalizes well across several tasks, including response ranking, reinforcement learning from human feedback, and task adaptation with fine-tuning.
arXiv Detail & Related papers (2025-06-17T04:34:27Z) - AXIOM: Learning to Play Games in Minutes with Expanding Object-Centric Models [41.429595107023125]
AXIOM is a novel architecture that integrates a minimal yet expressive set of core priors about object-centric dynamics and interactions.<n>It combines the usual data efficiency and interpretability of Bayesian approaches with the across-task generalization usually associated with DRL.<n> AXIOM masters various games within only 10,000 interaction steps, with both a small number of parameters compared to DRL, and without the computational expense of gradient-based optimization.
arXiv Detail & Related papers (2025-05-30T16:46:20Z) - Platonic Grounding for Efficient Multimodal Language Models [22.715168904364756]
We motivate and propose a simple modification to existing multimodal frameworks that rely on aligning pretrained models.<n>Our work also has implications for combining pretrained models into larger systems efficiently.
arXiv Detail & Related papers (2025-04-27T18:56:26Z) - Scaling Laws for Native Multimodal Models [53.490942903659565]
We revisit the architectural design of native multimodal models and conduct an extensive scaling laws study.<n>Our investigation reveals no inherent advantage to late-fusion architectures over early-fusion ones.<n>We show that incorporating Mixture of Experts (MoEs) allows models to learn modality-specific weights, significantly benefiting performance.
arXiv Detail & Related papers (2025-04-10T17:57:28Z) - Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining [74.83412846804977]
Reinforcement learning (RL)-based fine-tuning has become a crucial step in post-training language models.<n>We present a systematic end-to-end study of RL fine-tuning for mathematical reasoning by training models entirely from scratch.
arXiv Detail & Related papers (2025-04-10T17:15:53Z) - Active Data Curation Effectively Distills Large-Scale Multimodal Models [66.23057263509027]
Knowledge distillation (KD) is the de facto standard for compressing large-scale models into smaller ones.<n>In this work we explore an alternative, yet simple approach -- active data curation as effective distillation for contrastive multimodal pretraining.<n>Our simple online batch selection method, ACID, outperforms strong KD baselines across various model-, data- and compute-configurations.
arXiv Detail & Related papers (2024-11-27T18:50:15Z) - A Collaborative Ensemble Framework for CTR Prediction [73.59868761656317]
We propose a novel framework, Collaborative Ensemble Training Network (CETNet), to leverage multiple distinct models.
Unlike naive model scaling, our approach emphasizes diversity and collaboration through collaborative learning.
We validate our framework on three public datasets and a large-scale industrial dataset from Meta.
arXiv Detail & Related papers (2024-11-20T20:38:56Z) - Has Your Pretrained Model Improved? A Multi-head Posterior Based
Approach [25.927323251675386]
We leverage the meta-features associated with each entity as a source of worldly knowledge and employ entity representations from the models.
We propose using the consistency between these representations and the meta-features as a metric for evaluating pre-trained models.
Our method's effectiveness is demonstrated across various domains, including models with relational datasets, large language models and image models.
arXiv Detail & Related papers (2024-01-02T17:08:26Z) - Fantastic Gains and Where to Find Them: On the Existence and Prospect of
General Knowledge Transfer between Any Pretrained Model [74.62272538148245]
We show that for arbitrary pairings of pretrained models, one model extracts significant data context unavailable in the other.
We investigate if it is possible to transfer such "complementary" knowledge from one model to another without performance degradation.
arXiv Detail & Related papers (2023-10-26T17:59:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.