Related papers: Multi-Sentence Grounding for Long-term Instructional Video

Multi-Sentence Grounding for Long-term Instructional Video

URL: http://arxiv.org/abs/2312.14055v2
Date: Mon, 22 Jul 2024 03:17:29 GMT
Title: Multi-Sentence Grounding for Long-term Instructional Video
Authors: Zeqian Li, Qirui Chen, Tengda Han, Ya Zhang, Yanfeng Wang, Weidi Xie,
Abstract summary: We aim to establish an automatic, scalable pipeline for denoising a large-scale instructional dataset. We construct a high-quality video-text dataset with multiple descriptive steps supervision, named HowToStep.
Score: 63.27905419718045
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we aim to establish an automatic, scalable pipeline for denoising the large-scale instructional dataset and construct a high-quality video-text dataset with multiple descriptive steps supervision, named HowToStep. We make the following contributions: (i) improving the quality of sentences in dataset by upgrading ASR systems to reduce errors from speech recognition and prompting a large language model to transform noisy ASR transcripts into descriptive steps; (ii) proposing a Transformer-based architecture with all texts as queries, iteratively attending to the visual features, to temporally align the generated steps to corresponding video segments. To measure the quality of our curated datasets, we train models for the task of multi-sentence grounding on it, i.e., given a long-form video, and associated multiple sentences, to determine their corresponding timestamps in the video simultaneously, as a result, the model shows superior performance on a series of multi-sentence grounding tasks, surpassing existing state-of-the-art methods by a significant margin on three public benchmarks, namely, 9.0% on HT-Step, 5.1% on HTM-Align and 1.9% on CrossTask. All codes, models, and the resulting dataset have been publicly released.

Related papers

CI-VID: A Coherent Interleaved Text-Video Dataset [23.93099552431937]
CI-VID is a dataset that moves beyond isolated text-to-video (T2V) generation toward text-and-video-to-video (TV2V) generation.<n>It contains over 340,000 samples, each featuring a coherent sequence of video clips with text captions.<n>We show that models trained on CI-VID exhibit significant improvements in both accuracy and content consistency when generating video sequences.
arXiv Detail & Related papers (2025-07-02T17:48:01Z)
Everything Can Be Described in Words: A Simple Unified Multi-Modal Framework with Semantic and Temporal Alignment [0.0]
Long Video Question Answering (LVQA) is challenging due to the need for temporal reasoning and large-scale multimodal data processing. We introduce UMaT, a retrieval-augmented generation framework that efficiently processes extremely long videos. We show that UMaT outperforms existing methods in multimodal integration, long-form video understanding, and sparse information retrieval.
arXiv Detail & Related papers (2025-03-12T05:28:24Z)
Raccoon: Multi-stage Diffusion Training with Coarse-to-Fine Curating Videos [15.781862060265519]
CFC-VIDS-1M is a high-quality video dataset constructed through a systematic coarse-to-fine curation pipeline. We develop RACCOON, a transformer-based architecture with decoupled spatial-temporal attention mechanisms.
arXiv Detail & Related papers (2025-02-28T18:56:35Z)
Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis [13.702423348269155]
We propose a new task -- generating speech from videos of people and their transcripts (VTTS) -- to motivate new techniques for multimodal speech generation. We present a decoder-only multimodal model for this task, which we call Visatronic. It embeds vision, text and speech directly into the common subspace of a transformer model and uses an autoregressive loss to learn a generative model of discretized mel-spectrograms conditioned on speaker videos and transcripts of their speech.
arXiv Detail & Related papers (2024-11-26T18:57:29Z)
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations [120.52120919834988]
xGen-SynVideo-1 is a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens. DiT model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios.
arXiv Detail & Related papers (2024-08-22T17:55:22Z)
Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions. Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs. We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z)
Learning to Ground Instructional Articles in Videos through Narrations [50.3463147014498]
We present an approach for localizing steps of procedural activities in narrated how-to videos. We source the step descriptions from a language knowledge base (wikiHow) containing instructional articles. Our model learns to temporally ground the steps of procedural articles in how-to videos by matching three modalities.
arXiv Detail & Related papers (2023-06-06T15:45:53Z)
Hierarchical3D Adapters for Long Video-to-text Summarization [79.01926022762093]
multimodal information offers superior performance over more memory-heavy and fully fine-tuned textual summarization methods. Our experiments demonstrate that multimodal information offers superior performance over more memory-heavy and fully fine-tuned textual summarization methods.
arXiv Detail & Related papers (2022-10-10T16:44:36Z)
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions [31.4943447481144]
We study joint and language (VL) pre-training to enable cross-modality learning and benefit plentiful downstream tasks. Our model achieves new state-of-the-art results in 10 understanding tasks and 2 more novel text-to-visual generation tasks.
arXiv Detail & Related papers (2021-11-19T17:36:01Z)
Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning [62.47593143542552]
We describe a subset optimization approach for automatic dataset curation. We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data, despite being automatically constructed, achieve similar downstream performances to existing video datasets with similar scales.
arXiv Detail & Related papers (2021-01-26T14:27:47Z)
Multiresolution and Multimodal Speech Recognition with Transformers [22.995102995029576]
This paper presents an audio visual automatic speech recognition (AV-ASR) system using a Transformer-based architecture. We focus on the scene context provided by the visual information, to ground the ASR. Our results are comparable to state-of-the-art Listen, Attend and Spell-based architectures.
arXiv Detail & Related papers (2020-04-29T09:32:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.