Related papers: GLAP: General contrastive audio-text pretraining across domains and languages

GLAP: General contrastive audio-text pretraining across domains and languages

URL: http://arxiv.org/abs/2506.11350v1
Date: Thu, 12 Jun 2025 22:54:31 GMT
Title: GLAP: General contrastive audio-text pretraining across domains and languages
Authors: Heinrich Dinkel, Zhiyong Yan, Tianzi Wang, Yongqing Wang, Xingwei Sun, Yadong Niu, Jizhong Liu, Gang Li, Junbo Zhang, Jian Luan,
Abstract summary: We introduce general language audio pretraining (GLAP)<n>GLAP expands Contrastive Language Audio Pretraining (CLAP) with multilingual and multi-domain abilities.
Score: 26.996784244258073
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Contrastive Language Audio Pretraining (CLAP) is a widely-used method to bridge the gap between audio and text domains. Current CLAP methods enable sound and music retrieval in English, ignoring multilingual spoken content. To address this, we introduce general language audio pretraining (GLAP), which expands CLAP with multilingual and multi-domain abilities. GLAP demonstrates its versatility by achieving competitive performance on standard audio-text retrieval benchmarks like Clotho and AudioCaps, while significantly surpassing existing methods in speech retrieval and classification tasks. Additionally, GLAP achieves strong results on widely used sound-event zero-shot benchmarks, while simultaneously outperforming previous methods on speech content benchmarks. Further keyword spotting evaluations across 50 languages emphasize GLAP's advanced multilingual capabilities. Finally, multilingual sound and music understanding is evaluated across four languages. Checkpoints and Source: https://github.com/xiaomi-research/dasheng-glap.

Related papers

CLASP: Contrastive Language-Speech Pretraining for Multilingual Multimodal Information Retrieval [0.9023847175654603]
CLASP (Contrastive Language-Speech Pretraining) is a multilingual representation tailored for audio-text information retrieval.<n>During training, we utilize our newly introduced speech-text dataset, which encompasses 15 diverse categories ranging from fiction to religion.<n>Our evaluations across multiple languages demonstrate that CLASP establishes new benchmarks in HITS@1, MRR, and meanR metrics.
arXiv Detail & Related papers (2024-12-17T16:38:10Z)
Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models [58.43486430996411]
Large Audio-Language Models (LALMs) have recently unlocked audio dialogue capabilities, enabling direct spoken exchanges with humans.<n>We propose an Audio Dialogue Understanding Benchmark (ADU-Bench) to evaluate the performance of LALMs in the open-ended audio dialogue understanding.<n>ADU-Bench includes over 20,000 open-ended audio dialogues for the assessment of LALMs.
arXiv Detail & Related papers (2024-12-06T16:34:15Z)
Do Audio-Language Models Understand Linguistic Variations? [42.17718387132912]
Open-vocabulary audio language models (ALMs) represent a promising new paradigm for audio-text retrieval using natural language queries.<n>We propose RobustCLAP, a novel and compute-efficient technique to learn audio-language representations to linguistic variations.
arXiv Detail & Related papers (2024-10-21T20:55:33Z)
Bridging Language Gaps in Audio-Text Retrieval [28.829775980536574]
We propose a language enhancement (LE) using a multilingual text encoder (SONAR) to encode the text data with language-specific information. We optimize the audio encoder through the application of consistent ensemble distillation (CED), enhancing support for variable-length audio-text retrieval. Our methodology excels in English audio-text retrieval, demonstrating state-of-the-art (SOTA) performance on commonly used datasets such as AudioCaps and Clotho.
arXiv Detail & Related papers (2024-06-11T07:12:12Z)
ViSpeR: Multilingual Audio-Visual Speech Recognition [9.40993779729177]
This work presents an extensive and detailed study on Audio-Visual Speech Recognition for five widely spoken languages. We have collected large-scale datasets for each language except for English, and have engaged in the training of supervised learning models. Our model, ViSpeR, is trained in a multi-lingual setting, resulting in competitive performance on newly established benchmarks for each language.
arXiv Detail & Related papers (2024-05-27T14:48:51Z)
XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception [62.660135152900615]
Speech recognition and translation systems perform poorly on noisy inputs. XLAVS-R is a cross-lingual audio-visual speech representation model for noise-robust speech recognition and translation.
arXiv Detail & Related papers (2024-03-21T13:52:17Z)
FLAP: Fast Language-Audio Pre-training [16.46254370386555]
We propose Fast Language-Audio Pre-training (FLAP), a self-supervised approach that efficiently learns aligned audio and language representations. For efficiency, FLAP randomly drops audio spectrogram tokens, focusing solely on the remaining ones for self-supervision. FLAP learns to align paired audio and text representations in a shared latent space.
arXiv Detail & Related papers (2023-11-02T21:58:50Z)
AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models. It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z)
Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years. We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data. Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z)
Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling [92.55131711064935]
We propose a cross-lingual neural language model, VALL-E X, for cross-lingual speech synthesis. VALL-E X inherits strong in-context learning capabilities and can be applied for zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks. It can generate high-quality speech in the target language via just one speech utterance in the source language as a prompt while preserving the unseen speaker's voice, emotion, and acoustic environment.
arXiv Detail & Related papers (2023-03-07T14:31:55Z)
M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for Multilingual Speech to Image Retrieval [56.49878599920353]
This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval. For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin both when training separate models for each language, and with a single model which processes speech in all three languages.
arXiv Detail & Related papers (2022-11-02T14:54:45Z)
That Sounds Familiar: an Analysis of Phonetic Representations Transfer Across Languages [72.9927937955371]
We use the resources existing in other languages to train a multilingual automatic speech recognition model. We observe significant improvements across all languages in the multilingual setting, and stark degradation in the crosslingual setting. Our analysis uncovered that even the phones that are unique to a single language can benefit greatly from adding training data from other languages.
arXiv Detail & Related papers (2020-05-16T22:28:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.