Related papers: How do Multimodal Foundation Models Encode Text and Speech? An Analysis of Cross-Lingual and Cross-Modal Representations

How do Multimodal Foundation Models Encode Text and Speech? An Analysis of Cross-Lingual and Cross-Modal Representations

URL: http://arxiv.org/abs/2411.17666v1
Date: Tue, 26 Nov 2024 18:29:11 GMT
Title: How do Multimodal Foundation Models Encode Text and Speech? An Analysis of Cross-Lingual and Cross-Modal Representations
Authors: Hyunji Lee, Danni Liu, Supriti Sinhamahapatra, Jan Niehues,
Abstract summary: Cross-modal representations converge over model layers, except in the initial layers specialized at text and speech processing. Speech exhibits larger cross-lingual differences than text. For models not explicitly trained for modality-agnostic representations, the modality gap is more prominent than the language gap.
Score: 17.528100902591056
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal foundation models aim to create a unified representation space that abstracts away from surface features like language syntax or modality differences. To investigate this, we study the internal representations of three recent models, analyzing the model activations from semantically equivalent sentences across languages in the text and speech modalities. Our findings reveal that: 1) Cross-modal representations converge over model layers, except in the initial layers specialized at text and speech processing. 2) Length adaptation is crucial for reducing the cross-modal gap between text and speech, although current approaches' effectiveness is primarily limited to high-resource languages. 3) Speech exhibits larger cross-lingual differences than text. 4) For models not explicitly trained for modality-agnostic representations, the modality gap is more prominent than the language gap.

Related papers

Leveraging Unit Language Guidance to Advance Speech Modeling in Textless Speech-to-Speech Translation [48.769137497536]
We propose the unit language to overcome the two modeling challenges.<n>The unit language can be considered a text-like representation format.<n>We implement multi-task learning to utilize the unit language in guiding the speech modeling process.
arXiv Detail & Related papers (2025-05-21T10:05:25Z)
Leverage Points in Modality Shifts: Comparing Language-only and Multimodal Word Representations [0.8594140167290097]
Multimodal embeddings aim to enrich the semantic information in neural representations of language compared to text-only models. Our paper compares word embeddings from three vision-and-language models and three text-only models, with static and contextual representations. This is the first large-scale study of the effect of visual grounding on language representations, including 46 semantic parameters.
arXiv Detail & Related papers (2023-06-04T12:53:12Z)
Multilingual Multi-Figurative Language Detection [14.799109368073548]
figurative language understanding is highly understudied in a multilingual setting. We introduce multilingual multi-figurative language modelling, and provide a benchmark for sentence-level figurative language detection. We develop a framework for figurative language detection based on template-based prompt learning.
arXiv Detail & Related papers (2023-05-31T18:52:41Z)
ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Spotting [121.11880210592497]
We argue that the limited capacity of language models comes from 1) implicit language modeling; 2) unidirectional feature representation; and 3) language model with noise input. We propose an autonomous, bidirectional and iterative ABINet++ for scene text spotting.
arXiv Detail & Related papers (2022-11-19T03:50:33Z)
Cross-Align: Modeling Deep Cross-lingual Interactions for Word Alignment [63.0407314271459]
The proposed Cross-Align achieves the state-of-the-art (SOTA) performance on four out of five language pairs. Experiments show that the proposed Cross-Align achieves the state-of-the-art (SOTA) performance on four out of five language pairs.
arXiv Detail & Related papers (2022-10-09T02:24:35Z)
Unify and Conquer: How Phonetic Feature Representation Affects Polyglot Text-To-Speech (TTS) [3.57486761615991]
unified representations consistently achieves better cross-lingual synthesis with respect to both naturalness and accent. Separate representations tend to have an order of magnitude more tokens than unified ones, which may affect model capacity.
arXiv Detail & Related papers (2022-07-04T16:14:57Z)
AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context. It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts. Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z)
Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way. Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously. We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)
SPLAT: Speech-Language Joint Pre-Training for Spoken Language Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions. Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text. We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
Probing Contextual Language Models for Common Ground with Visual Representations [76.05769268286038]
We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations. Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories. Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
arXiv Detail & Related papers (2020-05-01T21:28:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.