Related papers: MooER: LLM-based Speech Recognition and Translation Models from Moore Threads

MooER: LLM-based Speech Recognition and Translation Models from Moore Threads

URL: http://arxiv.org/abs/2408.05101v1
Date: Fri, 9 Aug 2024 14:43:56 GMT
Title: MooER: LLM-based Speech Recognition and Translation Models from Moore Threads
Authors: Junhao Xu, Zhenlin Liang, Yi Liu, Yichao Hu, Jian Li, Yajun Zheng, Meng Cai, Hua Wang,
Abstract summary: MooER is a large-scale automatic speech recognition (ASR) / automatic speech translation (AST) model of Moore Threads. A 5000h pseudo labeled dataset containing open source and self collected speech data is used for training. Experiments conducted on Covost2 Zh2en testset suggest that our model outperforms other open source Speech LLMs.
Score: 13.02816167879662
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we present MooER, a LLM-based large-scale automatic speech recognition (ASR) / automatic speech translation (AST) model of Moore Threads. A 5000h pseudo labeled dataset containing open source and self collected speech data is used for training. We achieve performance comparable to other open source models trained with up to hundreds of thousands of hours of labeled speech data. Meanwhile, experiments conducted on Covost2 Zh2en testset suggest that our model outperforms other open source Speech LLMs. A BLEU score of 25.2 can be obtained. The main contributions of this paper are summarized as follows. First, this paper presents a training strategy for encoders and LLMs on speech related tasks (including ASR and AST) using a small size of pseudo labeled data without any extra manual annotation and selection. Second, we release our ASR and AST models and plan to open-source our training code and strategy in the near future. Moreover, a model trained on 8wh scale training data is planned to be released later on.

Related papers

AzeroS: Extending LLM to Speech with Self-Generated Instruction-Free Tuning [49.68129589035101]
We introduce AZeroS (Auden Zero-instruction-tuned Speech-LLM), which is trained on speech-text pairs derived from publicly available corpora.<n>AZeroS achieves state-of-the-art performance on both semantic and paralinguistic benchmarks.
arXiv Detail & Related papers (2025-12-31T04:05:04Z)
OpusLM: A Family of Open Unified Speech Language Models [56.14140121061921]
The OpusLMs are continuously pre-trained on 213K hours of speech-text pairs and 292B text-only tokens.<n>Technically, this paper articulates our SpeechLM designs on tokenization, multi-stream language models, and multi-stage training strategies.
arXiv Detail & Related papers (2025-06-21T06:30:59Z)
Scaling Analysis of Interleaved Speech-Text Language Models [22.61336359340435]
Speech Language Model (SLM) scaling analysis predicts that SLMs require much more compute and data compared to text. Modern SLMs are often initialised from pre-trained TextLMs using speech-text interleaving to allow knowledge transfer. We conduct scaling analysis of interleaved SLMs by training several dozen and analysing the scaling trends. We see that under this setup SLMs scale more efficiently with compute.
arXiv Detail & Related papers (2025-04-03T08:46:56Z)
Idiosyncrasies in Large Language Models [54.26923012617675]
We unveil and study idiosyncrasies in Large Language Models (LLMs)<n>We find that fine-tuning text embedding models on LLM-generated texts yields excellent classification accuracy.<n>We leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies.
arXiv Detail & Related papers (2025-02-17T18:59:02Z)
Zero-resource Speech Translation and Recognition with LLMs [38.11535502039386]
We propose to leverage a multilingual Large Language Model (LLM) to perform ST and ASR in languages for which the model has never seen paired audio-text data. We achieve this by using a pre-trained multilingual speech encoder, a multilingual LLM, and a lightweight adaptation module that maps the audio representations to the token embedding space of the LLM.
arXiv Detail & Related papers (2024-12-24T17:37:11Z)
SyllableLM: Learning Coarse Semantic Units for Speech Language Models [21.762112843104028]
We introduce a controllable self-supervised technique to merge speech representations into coarser syllable-like units. Our method produces controllable-rate semantic units at as low as 5Hz and 60bps and SotA inc segmentation and clustering. SyllableLM achieves significant improvements in efficiency with a 30x reduction in training compute and a 4x wall-clock inference speedup.
arXiv Detail & Related papers (2024-10-05T04:29:55Z)
Speech Recognition Rescoring with Large Speech-Text Foundation Models [20.145389016219106]
Large language models (LLM) have demonstrated the ability to understand human language by leveraging large amount of text data. Automatic speech recognition (ASR) systems are often limited by available transcribed speech data. Recent multi-modal large language models have demonstrated strong spoken language understanding.
arXiv Detail & Related papers (2024-09-25T06:17:23Z)
Comparing Discrete and Continuous Space LLMs for Speech Recognition [46.70297458685438]
This paper investigates discrete and continuous speech representations in Large Language Model (LLM)-based Automatic Speech Recognition (ASR) We classify LLMs based on their input and autoregressive feedback into continuous and discrete-space models. We present an open-sourced achievement of a state-of-the-art Word Error Rate (WER) of 1.69% on LibriSpeech using a HuBERT encoder.
arXiv Detail & Related papers (2024-09-01T18:29:45Z)
Alpaca against Vicuna: Using LLMs to Uncover Memorization of LLMs [61.04246774006429]
We introduce a black-box prompt optimization method that uses an attacker LLM agent to uncover higher levels of memorization in a victim agent. We observe that our instruction-based prompts generate outputs with 23.7% higher overlap with training data compared to the baseline prefix-suffix measurements. Our findings show that instruction-tuned models can expose pre-training data as much as their base-models, if not more so, and using instructions proposed by other LLMs can open a new avenue of automated attacks.
arXiv Detail & Related papers (2024-03-05T19:32:01Z)
Leveraging Weakly Annotated Data for Hate Speech Detection in Code-Mixed Hinglish: A Feasibility-Driven Transfer Learning Approach with Large Language Models [0.0]
Hate speech detection in mix-code low-resource languages is an active problem area where the use of Large Language Models has proven beneficial. In this study, we have compiled a dataset of 100 YouTube comments, and weakly labelled them for coarse and fine-grained misogyny classification in mix-code Hinglish. Out of all the approaches, zero-shot classification using the Bidirectional Auto-Regressive Transformers (BART) large model and few-shot prompting using Generative Pre-trained Transformer- 3 (ChatGPT-3) achieve the best results.
arXiv Detail & Related papers (2024-03-04T15:27:49Z)
YAYI 2: Multilingual Open-Source Large Language Models [53.92832054643197]
We propose YAYI 2, including both base and chat models, with 30 billion parameters. YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline. The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-12-22T17:34:47Z)
Generative Context-aware Fine-tuning of Self-supervised Speech Models [54.389711404209415]
We study the use of generative large language models (LLM) generated context information. We propose an approach to distill the generated information during fine-tuning of self-supervised speech models. We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks: automatic speech recognition, named entity recognition, and sentiment analysis.
arXiv Detail & Related papers (2023-12-15T15:46:02Z)
Textually Pretrained Speech Language Models [107.10344535390956]
We propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models. We show using both automatic and human evaluations that TWIST outperforms a cold-start SpeechLM across the board.
arXiv Detail & Related papers (2023-05-22T13:12:16Z)
Self-supervised Learning with Random-projection Quantizer for Speech Recognition [51.24368930992091]
We present a simple and effective self-supervised learning approach for speech recognition. The approach learns a model to predict masked speech signals, in the form of discrete labels. It achieves similar word-error-rates as previous work using self-supervised learning with non-streaming models.
arXiv Detail & Related papers (2022-02-03T21:29:04Z)
W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training [49.47516627019855]
w2v-BERT is a framework that combines contrastive learning and pre-supervised speech learning. Our experiments show that w2v-BERT achieves competitive results compared to current state-of-the-art pre-trained models.
arXiv Detail & Related papers (2021-08-07T06:29:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.