DynaStride: Dynamic Stride Windowing with MMCoT for Instructional Multi-Scene Captioning
- URL: http://arxiv.org/abs/2510.23907v1
- Date: Mon, 27 Oct 2025 22:29:08 GMT
- Title: DynaStride: Dynamic Stride Windowing with MMCoT for Instructional Multi-Scene Captioning
- Authors: Eddison Pham, Prisha Priyadarshini, Adrian Maliackel, Kanishk Bandi, Cristian Meo, Kevin Zhu,
- Abstract summary: Scene-level captioning in instructional videos can enhance learning by requiring an understanding of both visual cues and temporal structure.<n>We introduce DynaStride, a pipeline to generate coherent, scene-level captions without requiring manual scene segmentation.<n>We show that DynaStride produces captions that are more temporally coherent and informative, suggesting a promising direction for improving AI-powered instructional content generation.
- Score: 3.47287766500271
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Scene-level captioning in instructional videos can enhance learning by requiring an understanding of both visual cues and temporal structure. By aligning visual cues with textual guidance, this understanding supports procedural learning and multimodal reasoning, providing a richer context for skill acquisition. However, captions that fail to capture this structure may lack coherence and quality, which can create confusion and undermine the video's educational intent. To address this gap, we introduce DynaStride, a pipeline to generate coherent, scene-level captions without requiring manual scene segmentation. Using the YouCookII dataset's scene annotations, DynaStride performs adaptive frame sampling and multimodal windowing to capture key transitions within each scene. It then employs a multimodal chain-of-thought process to produce multiple action-object pairs, which are refined and fused using a dynamic stride window selection algorithm that adaptively balances temporal context and redundancy. The final scene-level caption integrates visual semantics and temporal reasoning in a single instructional caption. Empirical evaluations against strong baselines, including VLLaMA3 and GPT-4o, demonstrate consistent gains on both N-gram-based metrics (BLEU, METEOR) and semantic similarity measures (BERTScore, CLIPScore). Qualitative analyses further show that DynaStride produces captions that are more temporally coherent and informative, suggesting a promising direction for improving AI-powered instructional content generation.
Related papers
- Explicit Temporal-Semantic Modeling for Dense Video Captioning via Context-Aware Cross-Modal Interaction [33.79474114703357]
We propose an explicit temporal-semantic modeling framework called Context-Aware Cross-Modal Interaction (CACMI)<n>Our model consists of two core components: Cross-modal Frame Aggregation and Context-aware Feature Enhancement.<n>Experiments on the ActivityNet Captions and YouCook2 datasets demonstrate that CACMI achieves the state-of-the-art performance on dense video captioning task.
arXiv Detail & Related papers (2025-11-13T09:48:12Z) - Progress-Aware Video Frame Captioning [55.23366888264651]
We propose ProgressCaptioner, a captioning model designed to capture the fine-grained temporal dynamics within an action sequence.<n>We develop the FrameCap dataset to support training and the FrameCapEval benchmark to assess caption quality.<n>Results demonstrate that ProgressCaptioner significantly surpasses leading captioning models.
arXiv Detail & Related papers (2024-12-03T01:21:28Z) - Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos.
We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text.
We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z) - Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM
Animator [59.589919015669274]
This study focuses on zero-shot text-to-video generation considering the data- and cost-efficient.
We propose a novel Free-Bloom pipeline that harnesses large language models (LLMs) as the director to generate a semantic-coherence prompt sequence.
We also propose a series of annotative modifications to adapting LDMs in the reverse process, including joint noise sampling, step-aware attention shift, and dual-path.
arXiv Detail & Related papers (2023-09-25T19:42:16Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z) - Diverse Video Captioning by Adaptive Spatio-temporal Attention [7.96569366755701]
Our end-to-end encoder-decoder video captioning framework incorporates two transformer-based architectures.
We introduce an adaptive frame selection scheme to reduce the number of required incoming frames.
We estimate semantic concepts relevant for video captioning by aggregating all ground captions truth of each sample.
arXiv Detail & Related papers (2022-08-19T11:21:59Z) - Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels.
We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction.
We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z) - DVCFlow: Modeling Information Flow Towards Human-like Video Captioning [163.71539565491113]
Existing methods mainly generate captions from individual video segments, lacking adaptation to the global visual context.
We introduce the concept of information flow to model the progressive information changing across video sequence and captions.
Our method significantly outperforms competitive baselines, and generates more human-like text according to subject and objective tests.
arXiv Detail & Related papers (2021-11-19T10:46:45Z) - Visual-aware Attention Dual-stream Decoder for Video Captioning [12.139806877591212]
The attention mechanism in the current video captioning method learns to assign weight to each frame, promoting the decoder dynamically.
This may not explicitly model the correlation and the temporal coherence of the visual features extracted in the sequence frames.
We propose a new Visual-aware Attention (VA) model, which unifies changes of temporal sequence frames with the words at the previous moment.
The effectiveness of the proposed Visual-aware Attention Dual-stream Decoder (VADD) is demonstrated.
arXiv Detail & Related papers (2021-10-16T14:08:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.