Textual Supervision for Visually Grounded Spoken Language Understanding
- URL: http://arxiv.org/abs/2010.02806v2
- Date: Wed, 7 Oct 2020 07:48:12 GMT
- Title: Textual Supervision for Visually Grounded Spoken Language Understanding
- Authors: Bertrand Higy, Desmond Elliott, Grzegorz Chrupa{\l}a
- Abstract summary: Visually-grounded models of spoken language understanding extract semantic information directly from speech.
This is useful for low-resource languages, where transcriptions can be expensive or impossible to obtain.
Recent work showed that these models can be improved if transcriptions are available at training time.
- Score: 51.93744335044475
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visually-grounded models of spoken language understanding extract semantic
information directly from speech, without relying on transcriptions. This is
useful for low-resource languages, where transcriptions can be expensive or
impossible to obtain. Recent work showed that these models can be improved if
transcriptions are available at training time. However, it is not clear how an
end-to-end approach compares to a traditional pipeline-based approach when one
has access to transcriptions. Comparing different strategies, we find that the
pipeline approach works better when enough text is available. With low-resource
languages in mind, we also show that translations can be effectively used in
place of transcriptions but more data is needed to obtain similar results.
Related papers
- TransMI: A Framework to Create Strong Baselines from Multilingual Pretrained Language Models for Transliterated Data [50.40191599304911]
We propose Transliterate transliteration-Merge (TransMI), which can create a strong baseline well-suited for data that is transliterated into a common script.
Results show a consistent improvement of 3% to 34%, varying across different models and tasks.
arXiv Detail & Related papers (2024-05-16T09:08:09Z) - Deepfake audio as a data augmentation technique for training automatic
speech to text transcription models [55.2480439325792]
We propose a framework that approaches data augmentation based on deepfake audio.
A dataset produced by Indians (in English) was selected, ensuring the presence of a single accent.
arXiv Detail & Related papers (2023-09-22T11:33:03Z) - Optimal Transport Posterior Alignment for Cross-lingual Semantic Parsing [68.47787275021567]
Cross-lingual semantic parsing transfers parsing capability from a high-resource language (e.g., English) to low-resource languages with scarce training data.
We propose a new approach to cross-lingual semantic parsing by explicitly minimizing cross-lingual divergence between latent variables using Optimal Transport.
arXiv Detail & Related papers (2023-07-09T04:52:31Z) - Investigating Lexical Sharing in Multilingual Machine Translation for
Indian Languages [8.858671209228536]
We investigate lexical sharing in multilingual machine translation from Hindi, Gujarati, Nepali into English.
We find that transliteration does not give pronounced improvements.
Our analysis suggests that our multilingual MT models trained on original scripts seem to already be robust to cross-script differences.
arXiv Detail & Related papers (2023-05-04T23:35:15Z) - TS-Net: OCR Trained to Switch Between Text Transcription Styles [0.0]
We propose to extend existing text recognition networks with a Transcription Style Block (TSB)
TSB can learn from data to switch between multiple transcription styles without any explicit knowledge of transcription rules.
We show that TSB is able to learn completely different transcription styles in controlled experiments on artificial data.
arXiv Detail & Related papers (2021-03-09T15:21:40Z) - Enabling Interactive Transcription in an Indigenous Community [23.53585157238112]
We propose a novel transcription workflow which combines spoken term detection and human-in-the-loop.
We show that in the early stages of transcription, when the available data is insufficient to train a robust ASR system, it is possible to take advantage of the transcription of a small number of isolated words.
arXiv Detail & Related papers (2020-11-12T04:41:35Z) - Consecutive Decoding for Speech-to-text Translation [51.155661276936044]
COnSecutive Transcription and Translation (COSTT) is an integral approach for speech-to-text translation.
The key idea is to generate source transcript and target translation text with a single decoder.
Our method is verified on three mainstream datasets.
arXiv Detail & Related papers (2020-09-21T10:10:45Z) - Consistent Transcription and Translation of Speech [13.652411093089947]
We explore the task of jointly transcribing and translating speech.
While high accuracy of transcript and translation are crucial, even highly accurate systems can suffer from inconsistencies between both outputs.
We find that direct models are poorly suited to the joint transcription/translation task, but that end-to-end models that feature a coupled inference procedure are able to achieve strong consistency.
arXiv Detail & Related papers (2020-07-24T19:17:26Z) - Self-Supervised Representations Improve End-to-End Speech Translation [57.641761472372814]
We show that self-supervised pre-trained features can consistently improve the translation performance.
Cross-lingual transfer allows to extend to a variety of languages without or with little tuning.
arXiv Detail & Related papers (2020-06-22T10:28:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.