Related papers: Does Vision Accelerate Hierarchical Generalization of Neural Language Learners?

Does Vision Accelerate Hierarchical Generalization of Neural Language Learners?

URL: http://arxiv.org/abs/2302.00667v1
Date: Wed, 1 Feb 2023 18:53:42 GMT
Title: Does Vision Accelerate Hierarchical Generalization of Neural Language Learners?
Authors: Tatsuki Kuribayashi
Abstract summary: We conducted two experiments toward the advantage of vision in the syntactic generalization of LMs. Our results showed that vision accelerated a proper linguistic generalization in the simplified, artificial setting, but LMs struggled with the noisy, realistic setting. These mixed results indicate several possibilities, e.g., vision can potentially boost language acquisition, but learners' additional visual/linguistic prior knowledge should be needed.
Score: 5.073880854565685
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Neural language models (LMs) are arguably less data-efficient than humans -- why does this gap occur? In this study, we hypothesize that this gap stems from the learners' accessibility to modalities other than text, specifically, vision. We conducted two complementary experiments (using noisy, realistic data and a simplified, artificial one) toward the advantage of vision in the syntactic generalization of LMs. Our results showed that vision accelerated a proper linguistic generalization in the simplified, artificial setting, but LMs struggled with the noisy, realistic setting. These mixed results indicate several possibilities, e.g., vision can potentially boost language acquisition, but learners' additional visual/linguistic prior knowledge should be needed to robustly make use of raw images for efficient language acquisition.

Related papers

MLLMs are Deeply Affected by Modality Bias [158.64371871084478]
Recent advances in Multimodal Large Language Models (MLLMs) have shown promising results in integrating diverse modalities such as texts and images.<n>MLLMs are heavily influenced by modality bias, often relying on language while under-utilizing other modalities like visual inputs.<n>This paper argues that MLLMs are deeply affected by modality bias, highlighting its manifestations across various tasks.
arXiv Detail & Related papers (2025-05-24T11:49:31Z)
HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning [14.038083767470019]
Embodied agents operating in smart homes must understand human behavior through diverse sensory inputs and communicate via natural language.<n>In this paper, we introduce HoloLLM, a Multimodal Large Language Model (MLLM) that integrates uncommon but powerful sensing modalities.<n>We show that HoloLLM significantly outperforms existing MLLMs, improving language-grounded human sensing accuracy by up to 30%.
arXiv Detail & Related papers (2025-05-23T09:06:09Z)
Can Language Models Learn Typologically Implausible Languages? [62.823015163987996]
Grammatical features across human languages show intriguing correlations often attributed to learning biases in humans. We discuss how language models (LMs) allow us to better determine the role of domain-general learning biases in language universals. We test LMs on an array of highly naturalistic but counterfactual versions of the English (head-initial) and Japanese (head-final) languages.
arXiv Detail & Related papers (2025-02-17T20:40:01Z)
Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance [67.26434607115392]
Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks. LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension. We propose LACING to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG)
arXiv Detail & Related papers (2024-11-21T16:33:30Z)
Exploring Spatial Schema Intuitions in Large Language and Vision Models [8.944921398608063]
We investigate whether large language models (LLMs) effectively capture implicit human intuitions about building blocks of language. Surprisingly, correlations between model outputs and human responses emerge, revealing adaptability without a tangible connection to embodied experiences. This research contributes to a nuanced understanding of the interplay between language, spatial experiences, and computations made by large language models.
arXiv Detail & Related papers (2024-02-01T19:25:50Z)
Divergences between Language Models and Human Brains [63.405788999891335]
Recent research has hinted that brain signals can be effectively predicted using internal representations of language models (LMs) We show that there are clear differences in how LMs and humans represent and use language. We identify two domains that are not captured well by LMs: social/emotional intelligence and physical commonsense.
arXiv Detail & Related papers (2023-11-15T19:02:40Z)
CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding [66.52659447360104]
CoVLM can guide the LLM to explicitly compose visual entities and relationships among the text. We propose CoVLM, which can guide the LLM to explicitly compose visual entities and relationships among the text.
arXiv Detail & Related papers (2023-11-06T18:59:44Z)
Visual Grounding Helps Learn Word Meanings in Low-Data Regimes [47.7950860342515]
Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension. But to achieve these results, LMs must be trained in distinctly un-human-like ways. Do models trained more naturalistically -- with grounded supervision -- exhibit more humanlike language learning? We investigate this question in the context of word learning, a key sub-task in language acquisition.
arXiv Detail & Related papers (2023-10-20T03:33:36Z)
Tackling Vision Language Tasks Through Learning Inner Monologues [10.795616787372625]
We propose a novel approach, Inner Monologue Multi-Modal Optimization (IMMO), to solve complex vision language problems. IMMO simulates inner monologue processes, a cognitive process in which an individual engages in silent verbal communication with themselves. The results suggest IMMO can enhance reasoning and explanation abilities, contributing to the more effective fusion of vision and language models.
arXiv Detail & Related papers (2023-08-19T10:10:49Z)
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs [101.50522135049198]
BuboGPT is a multi-modal LLM with visual grounding that can perform cross-modal interaction between vision, audio and language. Our contributions are two-fold: 1) An off-the-shelf visual grounding module based on SAM that extracts entities in a sentence and find corresponding masks in the image. Our experiments show that BuboGPT achieves impressive multi-modality understanding and visual grounding abilities during the interaction with human.
arXiv Detail & Related papers (2023-07-17T15:51:47Z)
Context Limitations Make Neural Language Models More Human-Like [32.488137777336036]
We show discrepancies in context access between modern neural language models (LMs) and humans in incremental sentence processing. Additional context limitation was needed to make LMs better simulate human reading behavior. Our analyses also showed that human-LM gaps in memory access are associated with specific syntactic constructions.
arXiv Detail & Related papers (2022-05-23T17:01:13Z)
Towards Language Modelling in the Speech Domain Using Sub-word Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes. With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech. We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z)
Presentation and Analysis of a Multimodal Dataset for Grounded Language Learning [32.28310581819443]
Grounded language acquisition involves learning how language-based interactions refer to the world around them. In practice the data used for learning tends to be cleaner, clearer, and more grammatical than actual human interactions. We present a dataset of common household objects described by people using either spoken or written language.
arXiv Detail & Related papers (2020-07-29T17:58:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.