Related papers: The Curious Case of Visual Grounding: Different Effects for Speech- and Text-based Language Encoders

The Curious Case of Visual Grounding: Different Effects for Speech- and Text-based Language Encoders

URL: http://arxiv.org/abs/2509.15837v1
Date: Fri, 19 Sep 2025 10:16:58 GMT
Title: The Curious Case of Visual Grounding: Different Effects for Speech- and Text-based Language Encoders
Authors: Adrian Sauter, Willem Zuidema, Marianne de Heer Kloots,
Abstract summary: We show that visual grounding affects model-internal representations of words.<n>We also find different effects in speech- vs. text-based language encoders.<n>Our findings could usefully inform the development of more efficient methods to enrich speech-based models with visually-informed semantics.
Score: 1.5566524830295307
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: How does visual information included in training affect language processing in audio- and text-based deep learning models? We explore how such visual grounding affects model-internal representations of words, and find substantially different effects in speech- vs. text-based language encoders. Firstly, global representational comparisons reveal that visual grounding increases alignment between representations of spoken and written language, but this effect seems mainly driven by enhanced encoding of word identity rather than meaning. We then apply targeted clustering analyses to probe for phonetic vs. semantic discriminability in model representations. Speech-based representations remain phonetically dominated with visual grounding, but in contrast to text-based representations, visual grounding does not improve semantic discriminability. Our findings could usefully inform the development of more efficient methods to enrich speech-based models with visually-informed semantics.

Related papers

Layover or Direct Flight: Rethinking Audio-Guided Image Segmentation [65.7990140284317]
We focus on object grounding, i.e., localizing an object of interest in a visual scene based on verbal human instructions.<n>To explore this possibility, we simplify the task by focusing on grounding from single-word spoken instructions.<n>Our results demonstrate that direct grounding from audio is not only feasible but, in some cases, even outperforms transcription-based methods.
arXiv Detail & Related papers (2025-11-27T02:00:28Z)
CLARA: Multilingual Contrastive Learning for Audio Representation Acquisition [5.520654376217889]
CLARA minimizes reliance on labelled data, enhancing generalization across languages. Our approach adeptly captures emotional nuances in speech, overcoming subjective assessment issues. It adapts to low-resource languages, marking progress in multilingual speech representation learning.
arXiv Detail & Related papers (2023-10-18T09:31:56Z)
Leverage Points in Modality Shifts: Comparing Language-only and Multimodal Word Representations [0.8594140167290097]
Multimodal embeddings aim to enrich the semantic information in neural representations of language compared to text-only models. Our paper compares word embeddings from three vision-and-language models and three text-only models, with static and contextual representations. This is the first large-scale study of the effect of visual grounding on language representations, including 46 semantic parameters.
arXiv Detail & Related papers (2023-06-04T12:53:12Z)
Bridging the Gap: Using Deep Acoustic Representations to Learn Grounded Language from Percepts and Raw Speech [26.076534338576234]
Learning to understand grounded language, which connects natural language to percepts, is a critical research area. In this work we demonstrate the feasibility of performing grounded language acquisition on paired visual percepts and raw speech inputs.
arXiv Detail & Related papers (2021-12-27T16:12:30Z)
Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning [3.441021278275805]
We design a two-stream model for grounding language learning in vision. The model first learns to align visual and language representations with the MS COCO dataset. After training, the language stream of this model is a stand-alone language model capable of embedding concepts in a visually grounded semantic space.
arXiv Detail & Related papers (2021-11-13T19:54:15Z)
From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network [70.47504933083218]
We propose a Visual Language Modeling Network (VisionLAN), which views the visual and linguistic information as a union. VisionLAN significantly improves the speed by 39% and adaptively considers the linguistic information to enhance the visual features for accurate recognition.
arXiv Detail & Related papers (2021-08-22T07:56:24Z)
Learning Audio-Visual Dereverberation [87.52880019747435]
Reverberation from audio reflecting off surfaces and objects in the environment not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition. Our idea is to learn to dereverberate speech from audio-visual observations. We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed sounds and visual scene.
arXiv Detail & Related papers (2021-06-14T20:01:24Z)
Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images. "vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora. Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z)
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks. Traditionally, these tasks have been tackled using signal processing and machine learning techniques. Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z)
"Notic My Speech" -- Blending Speech Patterns With Multimedia [65.91370924641862]
We propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding. Our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate. We show that there is a strong correlation between our model's understanding of multi-view speech and the human perception.
arXiv Detail & Related papers (2020-06-12T06:51:55Z)
Probing Contextual Language Models for Common Ground with Visual Representations [76.05769268286038]
We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations. Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories. Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
arXiv Detail & Related papers (2020-05-01T21:28:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.