Punctuation Restoration
- URL: http://arxiv.org/abs/2202.09695v1
- Date: Sat, 19 Feb 2022 23:12:57 GMT
- Title: Punctuation Restoration
- Authors: Viet Dac Lai, Amir Pouran Ben Veyseh, Franck Dernoncourt, Thien Huu
Nguyen
- Abstract summary: This work presents a new human-annotated corpus, called BehancePR, for punctuation restoration in livestreaming video transcripts.
Our experiments on BehancePR demonstrate the challenges of punctuation restoration for this domain.
- Score: 69.97278287534157
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Given the increasing number of livestreaming videos, automatic speech
recognition and post-processing for livestreaming video transcripts are crucial
for efficient data management as well as knowledge mining. A key step in this
process is punctuation restoration which restores fundamental text structures
such as phrase and sentence boundaries from the video transcripts. This work
presents a new human-annotated corpus, called BehancePR, for punctuation
restoration in livestreaming video transcripts. Our experiments on BehancePR
demonstrate the challenges of punctuation restoration for this domain.
Furthermore, we show that popular natural language processing toolkits are
incapable of detecting sentence boundary on non-punctuated transcripts of
livestreaming videos, calling for more research effort to develop robust models
for this area.
Related papers
- Recent Advances in Speech Language Models: A Survey [45.968078636811356]
Speech Language Models (SpeechLMs) are end-to-end models that generate speech without converting from text.
This paper provides the first comprehensive overview of recent methodologies for constructing SpeechLMs.
arXiv Detail & Related papers (2024-10-01T21:48:12Z) - Speech Editing -- a Summary [8.713498822221222]
This paper explores text-based speech editing methods that modify audio via text transcripts without manual waveform editing.
The aim is to highlight ongoing issues and inspire further research and innovation in speech editing.
arXiv Detail & Related papers (2024-07-24T11:22:57Z) - Towards Accurate Lip-to-Speech Synthesis in-the-Wild [31.289366690147556]
We introduce a novel approach to address the task of synthesizing speech from silent videos of any in-the-wild speaker solely based on lip movements.
The traditional approach of directly generating speech from lip videos faces the challenge of not being able to learn a robust language model from speech alone.
We propose incorporating noisy text supervision using a state-of-the-art lip-to-text network that instills language information into our model.
arXiv Detail & Related papers (2024-03-02T04:07:24Z) - TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at
Scale [59.01246141215051]
We analyze the factor that leads to degradation from the perspective of language supervision.
We propose a tunable-free pre-training strategy to retain the generalization ability of the text encoder.
We produce a series of models, dubbed TVTSv2, with up to one billion parameters.
arXiv Detail & Related papers (2023-05-23T15:44:56Z) - SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language
Model [56.49878599920353]
SpeechCLIP is a novel framework bridging speech and text through images to enhance speech models without transcriptions.
We leverage state-of-the-art pre-trained HuBERT and CLIP, aligning them via paired images and spoken captions with minimal fine-tuning.
arXiv Detail & Related papers (2022-10-03T04:15:36Z) - Transcribing Natural Languages for The Deaf via Neural Editing Programs [84.0592111546958]
We study the task of glossification, of which the aim is to em transcribe natural spoken language sentences for the Deaf (hard-of-hearing) community to ordered sign language glosses.
Previous sequence-to-sequence language models often fail to capture the rich connections between the two distinct languages, leading to unsatisfactory transcriptions.
We observe that despite different grammars, glosses effectively simplify sentences for the ease of deaf communication, while sharing a large portion of vocabulary with sentences.
arXiv Detail & Related papers (2021-12-17T16:21:49Z) - StreamHover: Livestream Transcript Summarization and Annotation [54.41877742041611]
We present StreamHover, a framework for annotating and summarizing livestream transcripts.
With a total of over 500 hours of videos annotated with both extractive and abstractive summaries, our benchmark dataset is significantly larger than currently existing annotated corpora.
We show that our model generalizes better and improves performance over strong baselines.
arXiv Detail & Related papers (2021-09-11T02:19:37Z) - Towards Automatic Speech to Sign Language Generation [35.22004819666906]
We propose a multi-language transformer network trained to generate signer's poses from speech segments.
Our model learns to generate continuous sign pose sequences in an end-to-end manner.
arXiv Detail & Related papers (2021-06-24T06:44:19Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.