Abstractive Summarization of Spoken and Written Instructions with BERT
- URL: http://arxiv.org/abs/2008.09676v3
- Date: Wed, 26 Aug 2020 20:46:23 GMT
- Title: Abstractive Summarization of Spoken and Written Instructions with BERT
- Authors: Alexandra Savelieva, Bryan Au-Yeung, and Vasanth Ramani
- Abstract summary: We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
- Score: 66.14755043607776
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Summarization of speech is a difficult problem due to the spontaneity of the
flow, disfluencies, and other issues that are not usually encountered in
written texts. Our work presents the first application of the BERTSum model to
conversational language. We generate abstractive summaries of narrated
instructional videos across a wide variety of topics, from gardening and
cooking to software configuration and sports. In order to enrich the
vocabulary, we use transfer learning and pretrain the model on a few large
cross-domain datasets in both written and spoken English. We also do
preprocessing of transcripts to restore sentence segmentation and punctuation
in the output of an ASR system. The results are evaluated with ROUGE and
Content-F1 scoring for the How2 and WikiHow datasets. We engage human judges to
score a set of summaries randomly selected from a dataset curated from
HowTo100M and YouTube. Based on blind evaluation, we achieve a level of textual
fluency and utility close to that of summaries written by human content
creators. The model beats current SOTA when applied to WikiHow articles that
vary widely in style and topic, while showing no performance regression on the
canonical CNN/DailyMail dataset. Due to the high generalizability of the model
across different styles and domains, it has great potential to improve
accessibility and discoverability of internet content. We envision this
integrated as a feature in intelligent virtual assistants, enabling them to
summarize both written and spoken instructional content upon request.
Related papers
- Personalized Video Summarization using Text-Based Queries and Conditional Modeling [3.4447129363520337]
This thesis explores enhancing video summarization by integrating text-based queries and conditional modeling.
Evaluation metrics such as accuracy and F1-score assess the quality of the generated summaries.
arXiv Detail & Related papers (2024-08-27T02:43:40Z) - WAVER: Writing-style Agnostic Text-Video Retrieval via Distilling
Vision-Language Models Through Open-Vocabulary Knowledge [12.034917651508524]
$texttWAVER$ is a cross-domain knowledge distillation framework via vision-language models.
$texttWAVER$ capitalizes on the open-vocabulary properties that lie in pre-trained vision-language models.
It can achieve state-of-the-art performance in text-video retrieval task while handling writing-style variations.
arXiv Detail & Related papers (2023-12-15T03:17:37Z) - On Context Utilization in Summarization with Large Language Models [83.84459732796302]
Large language models (LLMs) excel in abstractive summarization tasks, delivering fluent and pertinent summaries.
Recent advancements have extended their capabilities to handle long-input contexts, exceeding 100k tokens.
We conduct the first comprehensive study on context utilization and position bias in summarization.
arXiv Detail & Related papers (2023-10-16T16:45:12Z) - Towards End-to-end Speech-to-text Summarization [0.0]
Speech-to-text (S2T) summarization is a time-saving technique for filtering and keeping up with the broadcast news uploaded online on a daily basis.
End-to-end (E2E) modelling of S2T abstractive summarization is a promising approach that offers the possibility of generating rich latent representations.
We model S2T summarization both with a cascade and an E2E system for a corpus of broadcast news in French.
arXiv Detail & Related papers (2023-06-06T15:22:16Z) - Learning a Grammar Inducer from Massive Uncurated Instructional Videos [118.7279072358029]
Video-aided grammar induction aims to leverage video information for finding more accurate syntactic grammars for accompanying text.
We build a new model that can better learn video-span correlation without manually designed features.
Our model yields higher F1 scores than the previous state-of-the-art systems trained on in-domain data.
arXiv Detail & Related papers (2022-10-22T00:22:55Z) - Leveraging Natural Supervision for Language Representation Learning and
Generation [8.083109555490475]
We describe three lines of work that seek to improve the training and evaluation of neural models using naturally-occurring supervision.
We first investigate self-supervised training losses to help enhance the performance of pretrained language models for various NLP tasks.
We propose a framework that uses paraphrase pairs to disentangle semantics and syntax in sentence representations.
arXiv Detail & Related papers (2022-07-21T17:26:03Z) - CLIP-It! Language-Guided Video Summarization [96.69415453447166]
This work introduces CLIP-It, a single framework for addressing both generic and query-focused video summarization.
We propose a language-guided multimodal transformer that learns to score frames in a video based on their importance relative to one another.
Our model can be extended to the unsupervised setting by training without ground-truth supervision.
arXiv Detail & Related papers (2021-07-01T17:59:27Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - Topic Adaptation and Prototype Encoding for Few-Shot Visual Storytelling [81.33107307509718]
We propose a topic adaptive storyteller to model the ability of inter-topic generalization.
We also propose a prototype encoding structure to model the ability of intra-topic derivation.
Experimental results show that topic adaptation and prototype encoding structure mutually bring benefit to the few-shot model.
arXiv Detail & Related papers (2020-08-11T03:55:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.