Related papers: Generalization in Multimodal Language Learning from Simulation

Generalization in Multimodal Language Learning from Simulation

URL: http://arxiv.org/abs/2108.02319v1
Date: Tue, 3 Aug 2021 12:55:18 GMT
Title: Generalization in Multimodal Language Learning from Simulation
Authors: Aaron Eisermann, Jae Hee Lee, Cornelius Weber, Stefan Wermter
Abstract summary: We investigate the influence of the underlying training data distribution on generalization in a minimal LSTM-based network trained in a supervised, time continuous setting. We find compositional generalization to fail in simple setups while improving with the number of objects, actions, and particularly with a lot of color overlaps between objects.
Score: 20.751952728808153
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Neural networks can be powerful function approximators, which are able to model high-dimensional feature distributions from a subset of examples drawn from the target distribution. Naturally, they perform well at generalizing within the limits of their target function, but they often fail to generalize outside of the explicitly learned feature space. It is therefore an open research topic whether and how neural network-based architectures can be deployed for systematic reasoning. Many studies have shown evidence for poor generalization, but they often work with abstract data or are limited to single-channel input. Humans, however, learn and interact through a combination of multiple sensory modalities, and rarely rely on just one. To investigate compositional generalization in a multimodal setting, we generate an extensible dataset with multimodal input sequences from simulation. We investigate the influence of the underlying training data distribution on compostional generalization in a minimal LSTM-based network trained in a supervised, time continuous setting. We find compositional generalization to fail in simple setups while improving with the number of objects, actions, and particularly with a lot of color overlaps between objects. Furthermore, multimodality strongly improves compositional generalization in settings where a pure vision model struggles to generalize.

Related papers

On the Out-of-Distribution Generalization of Reasoning in Multimodal LLMs for Simple Visual Planning Tasks [56.98385132295952]
We evaluate how well chain-of-thought approaches generalize on a simple planning task.<n>We find that reasoning traces which combine multiple text formats yield the best (and non-trivial) OOD generalization.<n> purely text-based models consistently outperform those utilizing image-based inputs.
arXiv Detail & Related papers (2026-02-17T09:51:40Z)
Improving Set Function Approximation with Quasi-Arithmetic Neural Networks [23.73257235603082]
We propose quasi-arithmetic neural networks (QUANNs)<n>QUANNs are universal approximators for a broad class of common set-function decompositions.<n>We provide a theoretical analysis showing that, QUANNs are universal approximators for a broad class of common set-function decompositions.
arXiv Detail & Related papers (2026-02-04T18:36:31Z)
OFA-MAS: One-for-All Multi-Agent System Topology Design based on Mixture-of-Experts Graph Generative Models [57.94189874119267]
Multi-Agent Systems (MAS) offer a powerful paradigm for solving complex problems.<n>Current graph learning-based design methodologies often adhere to a "one-for-one" paradigm.<n>We propose OFA-TAD, a one-for-all framework that generates adaptive collaboration graphs for any task described in natural language.
arXiv Detail & Related papers (2026-01-19T12:23:44Z)
Shortcut Learning in Generalist Robot Policies: The Role of Dataset Diversity and Fragmentation [117.54237701533805]
Generalist robot policies trained on large-scale datasets such as Open X-Embodiment (OXE) demonstrate strong performance across a wide range of tasks.<n>They often struggle to generalize beyond the distribution of their training data.<n>We identify shortcut learning as a key impediment to generalization.
arXiv Detail & Related papers (2025-08-08T16:14:01Z)
Algorithm Development in Neural Networks: Insights from the Streaming Parity Task [8.188549368578704]
We study the learning dynamics of neural networks trained on a streaming parity task.<n>We show that, with sufficient finite training experience, RNNs exhibit a phase transition to perfect infinite generalization.<n>Our results disclose one mechanism by which neural networks can generalize infinitely from finite training experience.
arXiv Detail & Related papers (2025-07-14T04:07:43Z)
Scale leads to compositional generalization [1.287456940851492]
We show that scaling data and model size leads to compositional generalization.<n>We show that this holds across different task encodings as long as the training distribution sufficiently covers the task space.<n>We uncover that if networks successfully compositionally generalize, the constituents of a task can be linearly decoded from their hidden activations.
arXiv Detail & Related papers (2025-07-09T18:30:50Z)
Sequential Compositional Generalization in Multimodal Models [23.52949473093583]
We conduct a comprehensive assessment of several unimodal and multimodal models. Our findings reveal that bi-modal and tri-modal models exhibit a clear edge over their text-only counterparts.
arXiv Detail & Related papers (2024-04-18T09:04:15Z)
On the generalization capacity of neural networks during generic multimodal reasoning [20.1430673356983]
We evaluate and compare large language models' capacity for multimodal generalization. For multimodal distractor and systematic generalization, either cross-modal attention or models with deeper attention layers are key architectural features required to integrate multimodal inputs.
arXiv Detail & Related papers (2024-01-26T17:42:59Z)
Generalization and Estimation Error Bounds for Model-based Neural Networks [78.88759757988761]
We show that the generalization abilities of model-based networks for sparse recovery outperform those of regular ReLU networks. We derive practical design rules that allow to construct model-based networks with guaranteed high generalization.
arXiv Detail & Related papers (2023-04-19T16:39:44Z)
Neural Networks and the Chomsky Hierarchy [27.470857324448136]
We study whether insights from the theory of Chomsky can predict the limits of neural network generalization in practice. We show negative results where even extensive amounts of data and training time never led to any non-trivial generalization. Our results show that, for our subset of tasks, RNNs and Transformers fail to generalize on non-regular tasks, and only networks augmented with structured memory can successfully generalize on context-free and context-sensitive tasks.
arXiv Detail & Related papers (2022-07-05T15:06:11Z)
On Neural Architecture Inductive Biases for Relational Tasks [76.18938462270503]
We introduce a simple architecture based on similarity-distribution scores which we name Compositional Network generalization (CoRelNet) We find that simple architectural choices can outperform existing models in out-of-distribution generalizations.
arXiv Detail & Related papers (2022-06-09T16:24:01Z)
CHALLENGER: Training with Attribution Maps [63.736435657236505]
We show that utilizing attribution maps for training neural networks can improve regularization of models and thus increase performance. In particular, we show that our generic domain-independent approach yields state-of-the-art results in vision, natural language processing and on time series tasks.
arXiv Detail & Related papers (2022-05-30T13:34:46Z)
Learning Prototype-oriented Set Representations for Meta-Learning [85.19407183975802]
Learning from set-structured data is a fundamental problem that has recently attracted increasing attention. This paper provides a novel optimal transport based way to improve existing summary networks. We further instantiate it to the cases of few-shot classification and implicit meta generative modeling.
arXiv Detail & Related papers (2021-10-18T09:49:05Z)
Dynamic Inference with Neural Interpreters [72.90231306252007]
We present Neural Interpreters, an architecture that factorizes inference in a self-attention network as a system of modules. inputs to the model are routed through a sequence of functions in a way that is end-to-end learned. We show that Neural Interpreters perform on par with the vision transformer using fewer parameters, while being transferrable to a new task in a sample efficient manner.
arXiv Detail & Related papers (2021-10-12T23:22:45Z)
Does language help generalization in vision models? [0.0]
We show that a visual model trained on a very large supervised image dataset (ImageNet-21k) can be as efficient for generalization as its multimodal counterpart (CLIP) When compared to other standard visual or language models, the latent representations of BiT-M were found to be just as "linguistic" as those of CLIP.
arXiv Detail & Related papers (2021-04-16T18:54:14Z)
Neural Complexity Measures [96.06344259626127]
We propose Neural Complexity (NC), a meta-learning framework for predicting generalization. Our model learns a scalar complexity measure through interactions with many heterogeneous tasks in a data-driven way.
arXiv Detail & Related papers (2020-08-07T02:12:10Z)
Identifying Critical Neurons in ANN Architectures using Mixed Integer Programming [11.712073757744452]
We introduce a mixed integer program (MIP) for assigning importance scores to each neuron in deep neural network architectures. We drive the solver to minimize the number of critical neurons (i.e., with high importance score) that need to be kept for maintaining the overall accuracy of the trained neural network.
arXiv Detail & Related papers (2020-02-17T21:32:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.