Efficient Pre-training for Localized Instruction Generation of Videos
- URL: http://arxiv.org/abs/2311.15964v4
- Date: Sat, 20 Jul 2024 17:55:37 GMT
- Title: Efficient Pre-training for Localized Instruction Generation of Videos
- Authors: Anil Batra, Davide Moltisanti, Laura Sevilla-Lara, Marcus Rohrbach, Frank Keller,
- Abstract summary: Procedural videos are instrumental in conveying step-by-step instructions.
Process Transformer (ProcX) is a model for end-to-end step localization and instruction generation for procedural videos.
- Score: 32.13509517228516
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Procedural videos, exemplified by recipe demonstrations, are instrumental in conveying step-by-step instructions. However, understanding such videos is challenging as it involves the precise localization of steps and the generation of textual instructions. Manually annotating steps and writing instructions is costly, which limits the size of current datasets and hinders effective learning. Leveraging large but noisy video-transcript datasets for pre-training can boost performance but demands significant computational resources. Furthermore, transcripts contain irrelevant content and differ in style from human-written instructions. To mitigate these issues, we propose a novel technique, Sieve-&-Swap, to automatically generate high-quality training data for the recipe domain: (i) Sieve: filters irrelevant transcripts and (ii) Swap: acquires high-quality text by replacing transcripts with human-written instruction from a text-only recipe dataset. The resulting dataset is three orders of magnitude smaller than current web-scale datasets but enables efficient training of large-scale models. Alongside Sieve-&-Swap, we propose Procedure Transformer (ProcX), a model for end-to-end step localization and instruction generation for procedural videos. When pre-trained on our curated dataset, this model achieves state-of-the-art performance on YouCook2 and Tasty while using a fraction of the training data. We have released code and dataset.
Related papers
- REInstruct: Building Instruction Data from Unlabeled Corpus [49.82314244648043]
We propose REInstruct, a method to automatically build instruction data from an unlabeled corpus.
By training Llama-7b on a combination of 3k seed data and 32k synthetic data from REInstruct, fine-tuned model achieves a 65.41% win rate on AlpacaEval leaderboard against text-davinci-003.
arXiv Detail & Related papers (2024-08-20T09:05:03Z) - Multi-Sentence Grounding for Long-term Instructional Video [63.27905419718045]
We aim to establish an automatic, scalable pipeline for denoising a large-scale instructional dataset.
We construct a high-quality video-text dataset with multiple descriptive steps supervision, named HowToStep.
arXiv Detail & Related papers (2023-12-21T17:28:09Z) - Learning to Ground Instructional Articles in Videos through Narrations [50.3463147014498]
We present an approach for localizing steps of procedural activities in narrated how-to videos.
We source the step descriptions from a language knowledge base (wikiHow) containing instructional articles.
Our model learns to temporally ground the steps of procedural articles in how-to videos by matching three modalities.
arXiv Detail & Related papers (2023-06-06T15:45:53Z) - MSdocTr-Lite: A Lite Transformer for Full Page Multi-script Handwriting
Recognition [3.0682439731292592]
We propose a lite transformer architecture for full-page multi-script handwriting recognition.
The proposed model comes with three advantages.
It can learn the reading order at page-level thanks to a curriculum learning strategy.
It can be easily adapted to other scripts by applying a simple transfer-learning process.
arXiv Detail & Related papers (2023-03-24T11:40:50Z) - One Embedder, Any Task: Instruction-Finetuned Text Embeddings [105.82772523968961]
INSTRUCTOR is a new method for computing text embeddings given task instructions.
Every text input is embedded together with instructions explaining the use case.
We evaluate INSTRUCTOR on 70 embedding evaluation tasks.
arXiv Detail & Related papers (2022-12-19T18:57:05Z) - Curriculum-Based Self-Training Makes Better Few-Shot Learners for
Data-to-Text Generation [56.98033565736974]
We propose Curriculum-Based Self-Training (CBST) to leverage unlabeled data in a rearranged order determined by the difficulty of text generation.
Our method can outperform fine-tuning and task-adaptive pre-training methods, and achieve state-of-the-art performance in the few-shot setting of data-to-text generation.
arXiv Detail & Related papers (2022-06-06T16:11:58Z) - Actuarial Applications of Natural Language Processing Using
Transformers: Case Studies for Using Text Features in an Actuarial Context [0.0]
This tutorial demonstrates to incorporate text data into actuarial classification and regression tasks.
The main focus is on methods employing transformer-based models.
The case studies tackle challenges related to a multi-lingual setting and long input sequences.
arXiv Detail & Related papers (2022-06-04T15:39:30Z) - CUPID: Adaptive Curation of Pre-training Data for Video-and-Language
Representation Learning [49.18591896085498]
We propose CUPID to bridge the domain gap between source and target data.
CUPID yields new state-of-the-art performance across multiple video-language and video tasks.
arXiv Detail & Related papers (2021-04-01T06:42:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.