$\text{M}^{\text{3}}$: A Modular World Model over Streams of Tokens
- URL: http://arxiv.org/abs/2502.11537v2
- Date: Thu, 20 Feb 2025 10:35:54 GMT
- Title: $\text{M}^{\text{3}}$: A Modular World Model over Streams of Tokens
- Authors: Lior Cohen, Kaixin Wang, Bingyi Kang, Uri Gadot, Shie Mannor,
- Abstract summary: Token-based world models emerged as a promising modular framework, modeling dynamics over token streams while optimizing tokenization separately.
In this paper, we introduce $textMtext3$, a $textbfm$odular $textbfw$orld $textbfm$odel that extends this framework.
$textMtext3$ achieves several improvements from existing literature to enhance agent performance.
- Score: 51.65485693709418
- License:
- Abstract: Token-based world models emerged as a promising modular framework, modeling dynamics over token streams while optimizing tokenization separately. While successful in visual environments with discrete actions (e.g., Atari games), their broader applicability remains uncertain. In this paper, we introduce $\text{M}^{\text{3}}$, a $\textbf{m}$odular $\textbf{w}$orld $\textbf{m}$odel that extends this framework, enabling flexible combinations of observation and action modalities through independent modality-specific components. $\text{M}^{\text{3}}$ integrates several improvements from existing literature to enhance agent performance. Through extensive empirical evaluation across diverse benchmarks, $\text{M}^{\text{3}}$ achieves state-of-the-art sample efficiency for planning-free world models. Notably, among these methods, it is the first to reach a human-level median score on Atari 100K, with superhuman performance on 13 games. Our code and model weights are publicly available at https://github.com/leor-c/M3.
Related papers
- Scaling Embedding Layers in Language Models [52.47659840377581]
SCONE enables two new scaling strategies: increasing the number of cached $n$-gram embeddings and scaling the model used to learn them, all while maintaining fixed inference-time FLOPS.
We show that scaling both aspects allows SCONE to outperform a 1.9B parameter baseline across diverse corpora, while using only half the inference-time FLOPS.
arXiv Detail & Related papers (2025-02-03T18:59:32Z) - TLDR: Token-Level Detective Reward Model for Large Vision Language Models [57.41524422460438]
Existing reward models only mimic human annotations by assigning only one binary feedback to any text.
We propose a $textbfT$oken-$textbfL$evel $textbfD$etective $textbfR$eward Model to provide fine-grained annotations to each text token.
arXiv Detail & Related papers (2024-10-07T04:00:22Z) - Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models [22.425339110551743]
We introduce $textitweak-to-strong search, framing the alignment of a large language model as a test-time greedy search.
In controlled-sentiment generation and summarization, we use tuned and untuned $textttgpt2$s to improve the alignment of large models without additional training.
In a more difficult instruction-following benchmark, we show that reusing off-the-shelf small models can improve the length-controlled win rates of both white-box and black-box large models.
arXiv Detail & Related papers (2024-05-29T16:55:32Z) - Connecting the Dots: Collaborative Fine-tuning for Black-Box Vision-Language Models [121.0693322732454]
This paper proposes a textbfCraFT' approach for fine-tuning black-box vision-language models to downstream tasks.
CraFT comprises two modules, a prompt generation module for learning text prompts and a prediction refinement module for enhancing output predictions in residual style.
Experiments on few-shot classification over 15 datasets demonstrate the superiority of CraFT.
arXiv Detail & Related papers (2024-02-06T14:53:19Z) - M$^{2}$Chat: Empowering VLM for Multimodal LLM Interleaved Text-Image Generation [45.79215260916687]
We propose textbf$M2Chat$, a novel unified multimodal LLM framework for generating interleaved text-image conversation.
$M3Adapter$ integrates granular low-level visual information and high-level semantic features from multi-modality prompts.
$M3FT$ fine-tuning strategy optimize disjoint groups of parameters for image-text alignment and visual-instruction.
arXiv Detail & Related papers (2023-11-29T11:30:33Z) - Unlocking Emergent Modularity in Large Language Models [27.12431620957652]
We show that standard Language Models (LMs) could be fine-tuned as their Mixture-of-Expert (MoEs) counterparts without introducing any extra parameters.
Our experiments demonstrate that fine-tuning EMoE effectively improves downstream in-domain and out-of-domain generalization compared with vanilla fine-tuning.
arXiv Detail & Related papers (2023-10-17T01:02:32Z) - Simplifying and Understanding State Space Models with Diagonal Linear
RNNs [56.33053691749856]
This work disposes of the discretization step, and proposes a model based on vanilla Diagonal Linear RNNs.
We empirically show that, despite being conceptually much simpler, $mathrmDLR$ is as performant as previously-proposed SSMs.
We also characterize the expressivity of SSMs and attention-based models via a suite of $13$ synthetic sequence-to-sequence tasks.
arXiv Detail & Related papers (2022-12-01T18:53:06Z) - Improving Robustness and Generality of NLP Models Using Disentangled
Representations [62.08794500431367]
Supervised neural networks first map an input $x$ to a single representation $z$, and then map $z$ to the output label $y$.
We present methods to improve robustness and generality of NLP models from the standpoint of disentangled representation learning.
We show that models trained with the proposed criteria provide better robustness and domain adaptation ability in a wide range of supervised learning tasks.
arXiv Detail & Related papers (2020-09-21T02:48:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.