Contextualizing ASR Lattice Rescoring with Hybrid Pointer Network
Language Model
- URL: http://arxiv.org/abs/2005.07394v1
- Date: Fri, 15 May 2020 07:47:33 GMT
- Title: Contextualizing ASR Lattice Rescoring with Hybrid Pointer Network
Language Model
- Authors: Da-Rong Liu, Chunxi Liu, Frank Zhang, Gabriel Synnaeve, Yatharth
Saraf, Geoffrey Zweig
- Abstract summary: In building automatic speech recognition systems, we can exploit the contextual information provided by video metadata.
We first use an attention based method to extract contextual vector representations of video metadata, and use these representations as part of the inputs to a neural language model.
Secondly, we propose a hybrid pointer network approach to explicitly interpolate the word probabilities of the word occurrences in metadata.
- Score: 26.78064626111014
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Videos uploaded on social media are often accompanied with textual
descriptions. In building automatic speech recognition (ASR) systems for
videos, we can exploit the contextual information provided by such video
metadata. In this paper, we explore ASR lattice rescoring by selectively
attending to the video descriptions. We first use an attention based method to
extract contextual vector representations of video metadata, and use these
representations as part of the inputs to a neural language model during lattice
rescoring. Secondly, we propose a hybrid pointer network approach to explicitly
interpolate the word probabilities of the word occurrences in metadata. We
perform experimental evaluations on both language modeling and ASR tasks, and
demonstrate that both proposed methods provide performance improvements by
selectively leveraging the video metadata.
Related papers
- Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset [4.452729255042396]
A more robust and holistic language-video representation is the key to pushing video understanding forward.
The current plain and simple text descriptions and the visual-only focus for the language-video tasks result in a limited capacity in real-world natural language video retrieval tasks.
This paper introduces a method to automatically enhance video-language datasets, making them more modality and context-aware.
arXiv Detail & Related papers (2024-06-19T20:16:17Z) - Improved Contextual Recognition In Automatic Speech Recognition Systems
By Semantic Lattice Rescoring [4.819085609772069]
We propose a novel approach for enhancing contextual recognition within ASR systems via semantic lattice processing.
Our solution consists of using Hidden Markov Models and Gaussian Mixture Models (HMM-GMM) along with Deep Neural Networks (DNN) models for better accuracy.
We demonstrate the effectiveness of our proposed framework on the LibriSpeech dataset with empirical analyses.
arXiv Detail & Related papers (2023-10-14T23:16:05Z) - DeVAn: Dense Video Annotation for Video-Language Models [68.70692422636313]
We present a novel human annotated dataset for evaluating the ability for visual-language models to generate descriptions for real-world video clips.
The dataset contains 8.5K YouTube video clips of 20-60 seconds in duration and covers a wide range of topics and interests.
arXiv Detail & Related papers (2023-10-08T08:02:43Z) - Video-Teller: Enhancing Cross-Modal Generation with Fusion and
Decoupling [79.49128866877922]
Video-Teller is a video-language foundation model that leverages multi-modal fusion and fine-grained modality alignment.
Video-Teller boosts the training efficiency by utilizing frozen pretrained vision and language modules.
It capitalizes on the robust linguistic capabilities of large language models, enabling the generation of both concise and elaborate video descriptions.
arXiv Detail & Related papers (2023-10-08T03:35:27Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - Controllable Video Captioning with an Exemplar Sentence [89.78812365216983]
We propose a novel Syntax Modulated Caption Generator (SMCG) incorporated in an encoder-decoder-reconstructor architecture.
SMCG takes video semantic representation as an input, and conditionally modulates the gates and cells of long short-term memory network.
We conduct experiments by collecting auxiliary exemplar sentences for two public video captioning datasets.
arXiv Detail & Related papers (2021-12-02T09:24:45Z) - Open-book Video Captioning with Retrieve-Copy-Generate Network [42.374461018847114]
In this paper, we convert traditional video captioning task into a new paradigm, ie, Open-book Video Captioning.
We propose a novel Retrieve-Copy-Generate network, where a pluggable video-to-text retriever is constructed to retrieve sentences as hints from the training corpus effectively.
Our framework coordinates the conventional retrieval-based methods with orthodox encoder-decoder methods, which can not only draw on the diverse expressions in the retrieved sentences but also generate natural and accurate content of the video.
arXiv Detail & Related papers (2021-03-09T08:17:17Z) - Neuro-Symbolic Representations for Video Captioning: A Case for
Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning.
Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z) - Contextual RNN-T For Open Domain ASR [41.83409885125617]
End-to-end (E2E) systems for automatic speech recognition (ASR) blend the individual components of a traditional hybrid ASR system into a single neural network.
This has some nice advantages, it limits the system to be trained using only paired audio and text.
Because of this, E2E models tend to have difficulties with correctly recognizing rare words that are not frequently seen during training, such as entity names.
We propose modifications to the RNN-T model that allow the model to utilize additional metadata text with the objective of improving performance on these named entity words.
arXiv Detail & Related papers (2020-06-04T04:37:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.